Eric’s Enlightenment for Monday, May 11, 2015

  1. Benjamin Morris used statistics to assess the value of Dennis Rodman as a rebounder and as a basketball player in general – and wrote one of the most epic series of blog posts in sports analytics.  Contrary to popular opinion, he eloquently argued why Rodman was a better rebounder than Wilt Chamberlain and Bill Russell.  In a digression in Part 1/4 (a), he used assist percentage to assess John Stockton’s greatness as a passer.
  2. I enjoy reading David Sherrill’s notes on quantum and computational chemistry.
  3. Read the first slide of this biostatistics lecture to learn how to calculate the concordance statistic (a.k.a. the C-statistic or the area under the receiver-operating characteristic (ROC) curve).
  4. Here are all of the videos of David Zetland’s lectures for his course on natural resource economics at Simon Fraser University.

Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

  • its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}

  • its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of


on the vertical axis and

1 - \text{Specificity}

on the horizontal axis, the line of no discrimination is the line that passes through the points

(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)


(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1).

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution \text{Bernoulli}(0.5) is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of 1 - \text{Specificity}, the \text{Sensitivity} of the binary classifier is higher than the \text{Sensitivity} of the line of no discrimination.)

Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves

A receiver operating characteristic (ROC) curve is a 2-dimensional plot of the \text{Sensitivity} (the true positive rate) versus 1 - \text{Specificity} (1 minus the true negative rate) of a binary classifier while varying its discrimination threshold.  In statistics and machine learning, a basic and popular tool for binary classification is logistic regression, and an ROC curve is a useful way to assess the predictive accuracy of the logistic regression model.

To illustrate with an example, let’s consider the Bernoulli response variable Y and the covariates X_1, X_2, ..., X_p.  A logistic regression model takes the covariates as inputs and returns P(Y = 1).  You as the user of the model must decide above which value of P(Y = 1) you will predict that Y = 1; this value is the discrimination threshold.  A common threshold is P(Y = 1) = 0.5.

Once you finish fitting the model with a training set, you can construct an ROC curve by following these steps below:

  1. Set a discrimination threshold.
  2. Use the covariates to predict Y for each observation in a validation set.
  3. Since you have the actual response values in the validation set, you can then calculate the sensitivity and specificity for your logistic regression model at that threshold.
  4. Repeat Steps 1-3 with a new threshold.
  5. Plot the values of \text{Sensitivity} versus 1 - \text{Specificity} for all thresholds.  The result is your ROC curve.

The use of a validation set to assess the predictive accuracy of a model is called validation, and it is a good practice for supervised learning.  If you have another fresh data set, it is also good practice to use that as a test set to assess the predictive accuracy of your model.

Note that you can perform Steps 2-5 for the training set, too – this is often done in statistics when you don’t have many data to work with, and the best that you can do is to assess the predictive accuracy of your model on the data set that you used to fit the model.