- is just shifted version of sigmoid - almost always work better than sigmoid - mean of activation function in hidden layers tend to be 0 - kind of "centers" your data - Andrew Ng almost never uses sigmoid now, just uses tanh - exception is output layer, cause output might be binary classification so range of sigmoid from 0 to 1 makes sense - downside of both sigmoid & tanh function is that if z is very large or small, the gradient is very small - heavily slows down gradient descent ![[CleanShot 2024-06-10 at [email protected]]]