- importance ranking of hyperparameters
	1. $\alpha$ learning rate
	2. $\beta$ [[Momentum Optimizer]] term 
	2.  [[Mini Batch Gradient Descent|Mini Batch]] size 
	2.  hidden units
	3. # layers
	3. [[Learning Rate Decay]]
	4. (pretty much never tune this) [[Adam Optimizer|Adam Optimizer]]
		- $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$
- use random sampling instead of grid/fixed search
	- each hyperparameter variable might have very stark differences in importance, so grid searching through all possibilities of an insignificant value can waste lots of computation