Machine Learning Lesson of the Day – Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

Applied Statistics Lesson of the Day – The Independent 2-Sample t-Test with Unequal Variances (Welch’s t-Test)

A common problem in statistics is determining whether or not the means of 2 populations are equal.  The independent 2-sample t-test is a popular parametric method to answer this question.  (In an earlier Statistics Lesson of the Day, I discussed how data collected from a completely randomized design with 1 binary factor can be analyzed by an independent 2-sample t-test.  I also discussed its possible use in the discovery of argon.)  I have learned 2 versions of the independent 2-sample t-test, and they differ on the variances of the 2 samples.  The 2 possibilities are

  • equal variances
  • unequal variances

Most statistics textbooks that I have read elaborate at length about the independent 2-sample t-test with equal variances (also called Student’s t-test).  However, the assumption of equal variances needs to be checked using the chi-squared test before proceeding with the Student’s t-test, yet this check does not seem to be universally done in practice.  Furthermore, conducting one test based on the results of another results possible inflation of Type 1 error (Ruxton, 2006).

Some books give due attention to the independent 2-sample t-test with unequal variances (also called Welch’s t-test), but some barely mention its value, and others do not even mention it at all.  I find this to be puzzling, because the assumption of equal variances is often violated in practice, and Welch’s t-test provides an easy solution to this problem.  There is a seemingly intimidating but straightforward calculation to approximate the number of degrees of freedom for Welch’s t-test, and this calculation is automatically incorporated in most software, including R and SAS.  Finally, Welch’s t-test removes the need to check for equal variances, and it is almost as powerful as Student’s t-test when the variances are equal (Ruxton, 2006).

For all of these reasons, I recommend Welch’s t-test when using the parametric approach to comparing the means of 2 populations.

 

Reference

Graeme D. Ruxton.  “The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test“.  Behavioral Ecology (July/August 2006) 17 (4): 688-690 first published online May 17, 2006

Machine Learning Lesson of the Day – Cross-Validation

Validation is a good way to assess the predictive accuracy of a supervised learning algorithm, and the rule of thumb of using 70% of the data for training and 30% of the data for validation generally works well.  However, what if the data set is not very large, and the small amount of data for training results in high sampling error?  A good way to overcome this problem is K-fold cross-validation.

Cross-validation is best defined by describing its steps:

For each model under consideration,

  1. Divide the data set into K partitions.
  2. Designate the first partition as the validation set and designate the other partitions as the training set.
  3. Use training set to train the algorithm.
  4. Use the validation set to assess the predictive accuracy of the algorithm; the common measure of predictive accuracy is mean squared error.
  5. Repeat Steps 2-4 for the second partition, third partition, … , the (K-1)th partition, and the Kth partition.  (Essentially, rotate the designation of validation set through every partition.)
  6. Calculate the average of the mean squared error from all K validations.

Compare the average mean squared errors of all models and pick the one with the smallest average mean squared error as the best model.  Test all models on a separate data set (called the test set) to assess their predictive accuracies on new, fresh data.

If there are N data in the data set, and K = N, then this type of K-fold cross-validation has a special name: leave-one-out cross-validation (LOOCV).

There some trade-offs between a large and a small K.  The estimator for the prediction error from a larger K results in

  • less bias because of more data being used for training
  • higher variance because of the higher similarity and lower diversity between the training sets
  • slower computation because of more data being used for training

In The Elements of Statistical Learning (2009 Edition, Chapter 7, Page 241-243), Hastie, Tibshirani and Friedman recommend 5 or 10 for K.

Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Introduction

This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots.  I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

histogram and kernel density plot

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA).  Previous posts in this series on EDA include

Read more of this post

Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

ecdf standard normal

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 251 other followers