Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning

Supervised learning puts a lot of emphasis on building a model that has high predictive accuracy.  Validation is a good method for assessing a model’s predictive accuracy.

Validation is the use of one part of your data set to build your model and another part of your data set to assess the model’s predictive accuracy.  Specifically,

  1. split your data set into 2 sets: a training set and a validation set
  2. use the training set to fit your model (e.g. LASSO regression)
  3. use the predictors in the validation set to predict the targets
  4. use some error measure (e.g mean squared error) to assess the differences between the predicted targets and the actual targets.

A good rule of thumb is to use 70% of your data for training and 30% of your data for your validation.

You should do this for several models (e.g. several different values of the penalty parameter in LASSO regression).  The model with the lowest mean squared error can be judged as the best model.

I highly encourage you to test your models on a separate data set – called a test set – from the same population or probability distribution and assess their predictive accuracies on the test set.  This is a good way to check for any overfitting in your models.

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: