- combines [[RMSprop Optimizer]] & [[Momentum Optimizer]]
- initialize $V_{dW} = 0$ & $S_{dW} = 0$
- $V_{dW} = \beta_1 * V_{dW} + (1 - \beta_1) d_W$
- $S_{dW} = \beta_2 * S_{dW} + (1 - \beta_2) d_W^2$
- $W = W - \alpha \frac{V_{dW}}{\sqrt{S_{dW}}}$
- $\alpha$ still needs to be tuned
- $\beta_1$ common choice is 0.9 (momentum, ~10 iteration average)
- $\beta_2$ inventors of adam recommend 0.999
- people generally just use the above beta values, practitioners generally only tune $\alpha$
- helps train networks much more quickly