![[CleanShot 2024-06-14 at [email protected]|350]] - don't normalize training & test set differently - you need to use the same $\mu$ & $\sigma$ - we need to normalize inputs, cause otherwise the cost function looks like an "elongated bowl" - your features end up on vastly different scales - i.e $w1 \in [0,1]$ & $w2 \in [-1000,1000]$ - you are forced to use a small learning rate - normalizing makes cost function easier to optimize - pretty much never causes any harm anyways