Batch Normalization - Brendan Shih

## Primary Info --- - normalizes node value based on batch (horizontal) ![[CleanShot 2024-07-10 at [email protected]|200]] ![[CleanShot 2024-07-10 at [email protected]|200]] - for training example $i$, given its $z^{[1]} ... z^{[m]}$, we normalize each of its $z^{[l]}$ - below is the standard normalization with mean 0 & variance 1 ![[CleanShot 2024-06-22 at [email protected]|350]] - if we want different mean & variance for distribution, we do below instead: - note the below gamme & beta are learnable parameters like weights/bias - $\gamma^{[1]}, \beta^{[1]}$ are for layer 1 ($\beta$ here has nothing to do with optimization $\beta$) - note that we actually get rid of bias $b$ as a parameter for the entire network, redundant with $\beta$ ![[CleanShot 2024-06-22 at [email protected]|350]] - typically implemented with [[Mini Batch Gradient Descent|Mini Batch]] - note the mean & variance used to normalize each node value comes from the entire mini batch's possible values for each node at that layer, not a single training example ## Side Info --- - makes [[Hyperparameter Tuning]] much easier - note when [[Normalizing Inputs]], we normalize $a^{[0]}$, batch normalization wants to normalize all other layers after input as well, i.e $a^{[3]}$ - helps with training - usually for a post-input layer in a network, because the earlier layers are constantly changing, when viewing that layer as an "input" layer, the distribution of the input dat will constantly change, which makes it harder for layers afterwards to adapt correctly - by using batch normalization, although this "input" data may vary, it will be governed by a mean & variance which makes it easier for later layers to adapt - batch norm has a slight regularizing effect - hard to use with sequential data - cause if sequences are of variable length, then batch norm gets really hard to calculate - my thought: you will have nodes missing other training examples that even have that node, so how would you calculate mean for that node across the examples? - so definitely bad for sequential data https://www.youtube.com/watch?v=2V3Uduw1zwQ&ab_channel=AssemblyAI