- is just shifted version of sigmoid
- almost always work better than sigmoid
- mean of activation function in hidden layers tend to be 0
- kind of "centers" your data
- Andrew Ng almost never uses sigmoid now, just uses tanh
- exception is output layer, cause output might be binary classification so range of sigmoid from 0 to 1 makes sense
- downside of both sigmoid & tanh function is that if z is very large or small, the gradient is very small
- heavily slows down gradient descent
![[CleanShot 2024-06-10 at
[email protected]]]