- with very deep [[Deep Learning|Neural Networks]], you get [[Vanishing and Exploding Gradients]]
- in resnets, we use shortcut/skip connections from earlier activation to later activation (add to a deeper Z before getting activated by [[Activation Functions|Activation Function]]) to fix this problem
- allows you to learn effectively even with many layers
- each blue segment below is a residual block
![[CleanShot 2024-07-04 at
[email protected]]]
![[CleanShot 2024-07-04 at
[email protected]]]
- adding residual components won't hurt performance, because ResNets can easily learn identity functions, specifically, in a residual block, it would learn the weights & biases in between as 0, so $a^{[l+2]} = a^{[l]}$ straight up, can basically ignore entire layers