Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning
January 7, 2014 Leave a comment
Supervised learning puts a lot of emphasis on building a model that has high predictive accuracy. Validation is a good method for assessing a model’s predictive accuracy.
Validation is the use of one part of your data set to build your model and another part of your data set to assess the model’s predictive accuracy. Specifically,
- split your data set into 2 sets: a training set and a validation set
- use the training set to fit your model (e.g. LASSO regression)
- use the predictors in the validation set to predict the targets
- use some error measure (e.g mean squared error) to assess the differences between the predicted targets and the actual targets.
A good rule of thumb is to use 70% of your data for training and 30% of your data for your validation.
You should do this for several models (e.g. several different values of the penalty parameter in LASSO regression). The model with the lowest mean squared error can be judged as the best model.
I highly encourage you to test your models on a separate data set – called a test set – from the same population or probability distribution and assess their predictive accuracies on the test set. This is a good way to check for any overfitting in your models.