- combines [[RMSprop Optimizer]] & [[Momentum Optimizer]] - initialize $V_{dW} = 0$ & $S_{dW} = 0$ - $V_{dW} = \beta_1 * V_{dW} + (1 - \beta_1) d_W$ - $S_{dW} = \beta_2 * S_{dW} + (1 - \beta_2) d_W^2$ - $W = W - \alpha \frac{V_{dW}}{\sqrt{S_{dW}}}$ - $\alpha$ still needs to be tuned - $\beta_1$ common choice is 0.9 (momentum, ~10 iteration average) - $\beta_2$ inventors of adam recommend 0.999 - people generally just use the above beta values, practitioners generally only tune $\alpha$ - helps train networks much more quickly