- importance ranking of hyperparameters 1. $\alpha$ learning rate 2. $\beta$ [[Momentum Optimizer]] term 2. [[Mini Batch Gradient Descent|Mini Batch]] size 2. hidden units 3. # layers 3. [[Learning Rate Decay]] 4. (pretty much never tune this) [[Adam Optimizer|Adam Optimizer]] - $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$ - use random sampling instead of grid/fixed search - each hyperparameter variable might have very stark differences in importance, so grid searching through all possibilities of an insignificant value can waste lots of computation