- [[He Initialization]]
- [[Xavier Initialization]]
## General
- we dislike weights that are too large, since they cause tanh or sigmoid to reach the slow gradient trap areas
- so we multiply numpy randn values by 0.01 for random initialization
- you can't just set initial weights to 0
- in a fully connected NN, let's say you initialize all weights in $W^{[1]}$ to 0, this will make each function result in layer 1 identical
- as you train they end up as the same thing
- note they each contribute to the next layer identically
- having more than 1 of these is redundant & useless
- you fix this by random initialization
- but you can initialize $b$ to 0 though
![[CleanShot 2024-06-10 at
[email protected]]]