- data is typically split into:
- training set
- dev/holdout/cross validation set
- test set
- you keep on training models on the training set, then you check which of many different models performs best on the dev set
- then having done this long enough, you can use the test set as the accurate external final measure of performance
- if you have a million training example, you could even feasibly just only give 10k (1%) examples respectively to dev & test sets
- this is the new approach, that overturns the old 60 20 20 splits that gave too much unecessarily
- for small datasets, you can use the old 60 20 20