- $v_t = \beta * v_{t-1} + (1 - \beta)\theta_t$ - higher $\beta$ means that you weight the past more - (averaged bigger window into the past) - note there is a rough cutoff when a training example from past gets reduced so much it now has negligible contribution - basically a historical average with more weight on recent updates - gives you a rough average with constant space memory (just store last $v_\theta$) - just 1 line of code too - actual average over last 50 days more accurate, but not as space efficient