- $V_{dW} = \beta * V_{dW} + (1 - \beta) d_W$
- $W = W - \alpha v_{dW}$
- almost always works faster than vanilla [[Deep Learning Training]]
- vanilla approach tends to cause oscillations, which messes with learning
- momentum smooths out the steps of GD
- uses [[Exponentially Weighted Averages]] for each gradient value used to update
- most common $\beta$ is 0.9, averages loss for roughly 10
![[CleanShot 2024-06-17 at
[email protected]|300]]