validation | The Chemical Statistician

Machine Learning Lesson of the Day – Overfitting and Underfitting

March 19, 2014 5 Comments

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or algorithm shows low bias but high variance. Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often. Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few. Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with bias, cross-validation, machine learning, overfitting, predictive modelling, statistics, underfitting, validation, variance

Machine Learning Lesson of the Day – K-Nearest Neighbours Regression

February 28, 2014 5 Comments

I recently introduced the K-nearest neighbours classifier. Some slight adjustments to the same algorithm can make it into a regression technique.

Given a training set and a new input $X$ , we can predict the target of the new input by

identifying the K data (the K “neighbours”) in the training set that are closest to $X$ by Euclidean distance
build a linear regression model to predict the target for $X$

the K data are the predictors
the reciprocals of the predictors’ distances to $X$ are their respective regression coefficients (the “weights”)

Validation or cross-validation can be used to determine the best number of “K”.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with cross-validation, Euclidean Distance, K-nearest neighbours regression, KNN, KNN regression, linear regression, training set, validation

Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

February 21, 2014 Leave a comment

The K-nearest neighbours (KNN) classifier is a non-parametric classification technique that classifies an input $X$ by

identifying the K data (the K “neighbours”) in the training set that are closest to $X$
counting the number of “neighbours” that belong to each class of the target variable
classifying $X$ by the most common class to which its neighbours belong

K is usually an odd number to avoid resolving ties.

The proximity of the neighbours to $X$ is usually defined by Euclidean distance.

Validation or cross-validation can be used to determine the best number of “K”.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with classification, cross-validation, Euclidean Distance, K-nearest neighbours, KNN, machine learning, non-parametric, supervised learning, training set, validation

Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

January 24, 2014 16 Comments

A model is a simplified representation of reality, and the simplifications are made to discard unnecessary detail and allow us to focus on the aspect of reality that we want to understand. These simplifications are grounded on assumptions; these assumptions may hold in some situations, but may not hold in other situations. This implies that a model that explains a certain situation well may fail in another situation. In both statistics and machine learning, we need to check our assumptions before relying on a model.

The “No Free Lunch” theorem states that there is no one model that works best for every problem. The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem. This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model. A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

Depending on the problem, it is important to assess the trade-offs between speed, accuracy, and complexity of different models and algorithms and find a model that works best for that particular problem.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with cross-validation, gradient descent, least squares regression, linear regression, machine learning, model, supervised learning, validation

Machine Learning Lesson of the Day – Cross-Validation

January 17, 2014 Leave a comment

Validation is a good way to assess the predictive accuracy of a supervised learning algorithm, and the rule of thumb of using 70% of the data for training and 30% of the data for validation generally works well. However, what if the data set is not very large, and the small amount of data for training results in high sampling error? A good way to overcome this problem is K-fold cross-validation.

Cross-validation is best defined by describing its steps:

For each model under consideration,

Divide the data set into K partitions.
Designate the first partition as the validation set and designate the other partitions as the training set.
Use training set to train the algorithm.
Use the validation set to assess the predictive accuracy of the algorithm; the common measure of predictive accuracy is mean squared error.
Repeat Steps 2-4 for the second partition, third partition, … , the (K-1)th partition, and the Kth partition. (Essentially, rotate the designation of validation set through every partition.)
Calculate the average of the mean squared error from all K validations.

Compare the average mean squared errors of all models and pick the one with the smallest average mean squared error as the best model. Test all models on a separate data set (called the test set) to assess their predictive accuracies on new, fresh data.

If there are N data in the data set, and K = N, then this type of K-fold cross-validation has a special name: leave-one-out cross-validation (LOOCV).

There some trade-offs between a large and a small K. The estimator for the prediction error from a larger K results in

less bias because of more data being used for training
higher variance because of the higher similarity and lower diversity between the training sets
slower computation because of more data being used for training

In The Elements of Statistical Learning (2009 Edition, Chapter 7, Page 241-243), Hastie, Tibshirani and Friedman recommend 5 or 10 for K.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with bias, computational cost, computational speed, cross-validation, K-fold cross-validation, machine learning, mean squared error, predictive accuracy, supervised learning, test set, training set, validation, validation set, variance

Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning

January 7, 2014 Leave a comment

Supervised learning puts a lot of emphasis on building a model that has high predictive accuracy. Validation is a good method for assessing a model’s predictive accuracy.

Validation is the use of one part of your data set to build your model and another part of your data set to assess the model’s predictive accuracy. Specifically,

split your data set into 2 sets: a training set and a validation set
use the training set to fit your model (e.g. LASSO regression)
use the predictors in the validation set to predict the targets
use some error measure (e.g mean squared error) to assess the differences between the predicted targets and the actual targets.

A good rule of thumb is to use 70% of your data for training and 30% of your data for your validation.

You should do this for several models (e.g. several different values of the penalty parameter in LASSO regression). The model with the lowest mean squared error can be judged as the best model.

I highly encourage you to test your models on a separate data set – called a test set – from the same population or probability distribution and assess their predictive accuracies on the test set. This is a good way to check for any overfitting in your models.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with machine learning, predictive accuracy, supervised learning, test set, training set, validation, validation set

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Machine Learning Lesson of the Day – Overfitting and Underfitting

Machine Learning Lesson of the Day – K-Nearest Neighbours Regression

Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

Machine Learning Lesson of the Day – Cross-Validation

Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories