- data is typically split into: - training set - dev/holdout/cross validation set - test set - you keep on training models on the training set, then you check which of many different models performs best on the dev set - then having done this long enough, you can use the test set as the accurate external final measure of performance - if you have a million training example, you could even feasibly just only give 10k (1%) examples respectively to dev & test sets - this is the new approach, that overturns the old 60 20 20 splits that gave too much unecessarily - for small datasets, you can use the old 60 20 20