- [[He Initialization]] - [[Xavier Initialization]] ## General - we dislike weights that are too large, since they cause tanh or sigmoid to reach the slow gradient trap areas - so we multiply numpy randn values by 0.01 for random initialization - you can't just set initial weights to 0 - in a fully connected NN, let's say you initialize all weights in $W^{[1]}$ to 0, this will make each function result in layer 1 identical - as you train they end up as the same thing - note they each contribute to the next layer identically - having more than 1 of these is redundant & useless - you fix this by random initialization - but you can initialize $b$ to 0 though ![[CleanShot 2024-06-10 at [email protected]]]