- derivatives when training [[Deep Learning|Neural Networks]] can become exponentially big or small
- happens especially in very deep networks
- i.e your gradients might be exponentially smaller, causing the learning to be super slow
- careful choices of [[Parameter Initialization]] can significantly reduce this problem
- as a function of $L$ (number of layers), the activations going right can exponentially increase (if $W > 1$) or decrease (if $W$ slightly
lt;1$)
- $\therefore$ you also get gradients that explode or vanish as a function of $L$
- if gradient is exponentially small, then learning will be very slow