- importance ranking of hyperparameters
1. $\alpha$ learning rate
2. $\beta$ [[Momentum Optimizer]] term
2. [[Mini Batch Gradient Descent|Mini Batch]] size
2. hidden units
3. # layers
3. [[Learning Rate Decay]]
4. (pretty much never tune this) [[Adam Optimizer|Adam Optimizer]]
- $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$
- use random sampling instead of grid/fixed search
- each hyperparameter variable might have very stark differences in importance, so grid searching through all possibilities of an insignificant value can waste lots of computation