Mini Batch Gradient Descent

- you can split your training data into small mini batches - do gradient descent update for each mini batch (do vectorized calculations for that specific mini batch each time) - Andrew Ng: when you dataset is super large, mini batch will always be faster than full batch gradient descent, since you don't have to process the entire training set for every single update - in terms of gradient calculation, smaller mini batches are less efficient than larger ones - but can make up for it with faster model convergence speeds, which might necessitate fewer epochs total ![[CleanShot 2024-06-16 at [email protected]|350]] - choosing your mini-batch size - if small (<2000): batch gradient descent (full) - typical mini batch sizes: - 64, 128, 256, 512 (powers of 2 might let computer run faster) - make sure each mini batch fits in CPU/GPU memory, or else becomes really slow - an [[Epoch]] refers to a complete pass through the training set, so would be after you update using all mini batches that make up all the data - if mini batch size is 1, you get [[Stochastic Gradient Descent]] - lots of noisiness, you also lose all speedup from vectorization https://www.kaggle.com/code/residentmario/full-batch-mini-batch-and-online-learning https://www.coursera.org/learn/deep-neural-network/lecture/lBXu8/understanding-mini-batch-gradient-descent