Vanishing and Exploding Gradients

- derivatives when training [[Deep Learning|Neural Networks]] can become exponentially big or small - happens especially in very deep networks - i.e your gradients might be exponentially smaller, causing the learning to be super slow - careful choices of [[Parameter Initialization]] can significantly reduce this problem - as a function of $L$ (number of layers), the activations going right can exponentially increase (if $W > 1$) or decrease (if $W$ slightly lt;1$) - $\therefore$ you also get gradients that explode or vanish as a function of $L$ - if gradient is exponentially small, then learning will be very slow