← Applied Statistics Lesson of the Day – Blocking and the Randomized Complete Blocked Design (RCBD)

Physical Chemistry Lesson of the Day – Intensive vs. Extensive Properties →

Machine Learning Lesson of the Day – Overfitting

January 29, 2014 Leave a comment

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set. That underlying trend cannot be precisely captured because of the random variation in the data around that trend. A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation. An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target. If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already. Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise. Thus, the new model with these added irrelevant predictors describes the trend and the noise. It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post. Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.

Filed under Machine Learning, Machine Learning Lesson of the Day, Statistics Tagged with linear regression, machine learning, multicollinearity, overfitting, statistics

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Machine Learning Lesson of the Day – Overfitting

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

Machine Learning Lesson of the Day – Overfitting

Share this:

Related

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories