- $V_{dW} = \beta * V_{dW} + (1 - \beta) d_W$ - $W = W - \alpha v_{dW}$ - almost always works faster than vanilla [[Deep Learning Training]] - vanilla approach tends to cause oscillations, which messes with learning - momentum smooths out the steps of GD - uses [[Exponentially Weighted Averages]] for each gradient value used to update - most common $\beta$ is 0.9, averages loss for roughly 10 ![[CleanShot 2024-06-17 at [email protected]|300]]