## Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of

$\text{Sensitivity}$

on the vertical axis and

$1 - \text{Specificity}$

on the horizontal axis, the line of no discrimination is the line that passes through the points

$(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)$

and

$(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1)$.

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution $\text{Bernoulli}(0.5)$ is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of $1 - \text{Specificity}$, the $\text{Sensitivity}$ of the binary classifier is higher than the $\text{Sensitivity}$ of the line of no discrimination.)

## Machine Learning Lesson of the Day – Estimating Coefficients in Linear Gaussian Basis Function Models

Recently, I introduced linear Gaussian basis function models as a suitable modelling technique for supervised learning problems that involve non-linear relationships between the target and the predictors.  Recall that linear basis function models are generalizations of linear regression that regress the target on functions of the predictors, rather than the predictors themselves.  In linear regression, the coefficients are estimated by the method of least squares.  Thus, it is natural that the estimation of the coefficients in linear Gaussian basis function models is an extension of the method of least squares.

The linear Gaussian basis function model is

$Y = \Phi \beta + \varepsilon$,

where $\Phi_{ij} = \phi_j (x_i)$.  In other words, $\Phi$ is the design matrix, and the element in row $i$ and column $j$ of this design matrix is the $i\text{th}$ predictor being evaluated in the $j\text{th}$ basis function.  (In this case, there is 1 predictor per datum.)

Applying the method of least squares, the coefficient vector, $\beta$, can be estimated by

$\hat{\beta} = (\Phi^{T} \Phi)^{-1} \Phi^{T} Y$.

Note that this looks like the least-squares estimator for the coefficient vector in linear regression, except that the design matrix is not $X$, but $\Phi$.

If you are not familiar with how $\hat{\beta}$ was obtained, I encourage you to review least-squares estimation and the derivation of the estimator of the coefficient vector in linear regression.

## Machine Learning Lesson of the Day – Introduction to Linear Basis Function Models

Given a supervised learning problem of using $p$ inputs ($x_1, x_2, ..., x_p$) to predict a continuous target $Y$, the simplest model to use would be linear regression.  However, what if we know that the relationship between the inputs and the target is non-linear, but we are unsure of exactly what form this relationship has?

One way to overcome this problem is to use linear basis function models.  These models assume that the target is a linear combination of a set of $p+1$ basis functions.

$Y_i = w_0 + w_1 \phi_1(x_1) + w_2 \phi_2(x_2) + ... + w_p \phi_p(x_p)$

This is a generalization of linear regression that essentially replaces each input with a function of the input.  (A linear basis function model that uses the identity function is just linear regression.)

The type of basis functions (i.e. the type of function given by $\phi$) is chosen to suitably model the non-linearity in the relationship between the inputs and the target.  It also needs to be chosen so that the computation is efficient.  I will discuss variations of linear basis function models in a later Machine Learning Lesson of the Day.

## Machine Learning Lesson of the Day – Memory-Based Learning

Memory-based learning (also called instance-based learning) is a type of non-parametric algorithm that compares new test data with training data in order to solve the given machine learning problem.  Such algorithms search for the training data that are most similar to the test data and make predictions based on these similarities.  (From what I have learned, memory-based learning is used for supervised learning only.  Can you think of any memory-based algorithms for unsupervised learning?)

A distinguishing feature of memory-based learning is its storage of the entire training set.  This is computationally costly, especially if the training set is large – the storage itself is costly, and the complexity of the model grows with a larger data set.  However, it is advantageous because it uses less assumptions than parametric models, so it is adaptable to problems for which the assumptions may fail and no clear pattern is known ex ante.  (In contrast, parametric models like linear regression make generalizations about the training data; after building a model to predict the targets, the training data are discarded, so there is no need to store them.)  Thus, I recommend using memory-based learning algorithms when the data set is relatively small and there is no prior knowledge or information about the underlying patterns in the data.

Two classic examples of memory-based learning are K-nearest neighbours classification and K-nearest neighbours regression.

## Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

The K-nearest neighbours (KNN) classifier is a non-parametric classification technique that classifies an input $X$ by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$
2. counting the number of “neighbours” that belong to each class of the target variable
3. classifying $X$ by the most common class to which its neighbours belong

K is usually an odd number to avoid resolving ties.

The proximity of the neighbours to $X$ is usually defined by Euclidean distance.

Validation or cross-validation can be used to determine the best number of “K”.

## Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

A model is a simplified representation of reality, and the simplifications are made to discard unnecessary detail and allow us to focus on the aspect of reality that we want to understand.  These simplifications are grounded on assumptions; these assumptions may hold in some situations, but may not hold in other situations.  This implies that a model that explains a certain situation well may fail in another situation.  In both statistics and machine learning, we need to check our assumptions before relying on a model.

The “No Free Lunch” theorem states that there is no one model that works best for every problem.  The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem.  This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model.  A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

Depending on the problem, it is important to assess the trade-offs between speed, accuracy, and complexity of different models and algorithms and find a model that works best for that particular problem.

## Machine Learning Lesson of the Day – Cross-Validation

Validation is a good way to assess the predictive accuracy of a supervised learning algorithm, and the rule of thumb of using 70% of the data for training and 30% of the data for validation generally works well.  However, what if the data set is not very large, and the small amount of data for training results in high sampling error?  A good way to overcome this problem is K-fold cross-validation.

Cross-validation is best defined by describing its steps:

For each model under consideration,

1. Divide the data set into K partitions.
2. Designate the first partition as the validation set and designate the other partitions as the training set.
3. Use training set to train the algorithm.
4. Use the validation set to assess the predictive accuracy of the algorithm; the common measure of predictive accuracy is mean squared error.
5. Repeat Steps 2-4 for the second partition, third partition, … , the (K-1)th partition, and the Kth partition.  (Essentially, rotate the designation of validation set through every partition.)
6. Calculate the average of the mean squared error from all K validations.

Compare the average mean squared errors of all models and pick the one with the smallest average mean squared error as the best model.  Test all models on a separate data set (called the test set) to assess their predictive accuracies on new, fresh data.

If there are N data in the data set, and K = N, then this type of K-fold cross-validation has a special name: leave-one-out cross-validation (LOOCV).

There some trade-offs between a large and a small K.  The estimator for the prediction error from a larger K results in

• less bias because of more data being used for training
• higher variance because of the higher similarity and lower diversity between the training sets
• slower computation because of more data being used for training

In The Elements of Statistical Learning (2009 Edition, Chapter 7, Page 241-243), Hastie, Tibshirani and Friedman recommend 5 or 10 for K.

## Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning

Supervised learning puts a lot of emphasis on building a model that has high predictive accuracy.  Validation is a good method for assessing a model’s predictive accuracy.

Validation is the use of one part of your data set to build your model and another part of your data set to assess the model’s predictive accuracy.  Specifically,

1. split your data set into 2 sets: a training set and a validation set
2. use the training set to fit your model (e.g. LASSO regression)
3. use the predictors in the validation set to predict the targets
4. use some error measure (e.g mean squared error) to assess the differences between the predicted targets and the actual targets.

A good rule of thumb is to use 70% of your data for training and 30% of your data for your validation.

You should do this for several models (e.g. several different values of the penalty parameter in LASSO regression).  The model with the lowest mean squared error can be judged as the best model.

I highly encourage you to test your models on a separate data set – called a test set – from the same population or probability distribution and assess their predictive accuracies on the test set.  This is a good way to check for any overfitting in your models.

## Machine Learning Lesson of the Day – Supervised Learning: Classification and Regression

Supervised learning has 2 categories:

• In classification, the target variable is categorical.
• In regression, the target variable is continuous.

Thus, regression in statistics is different from regression in supervised learning.

In statistics,

• regression is used to model relationships between predictors and targets, and the targets could be continuous or categorical.
• a regression model usually includes 2 components to describe such relationships:
• a systematic component
• a random component.  The random component of this relationship is mathematically described by some probability distribution.
• most regression models in statistics also have assumptions about the between the predictors and/or between the observations.
• many statistical models also aim to provide interpretable relationships between the predictors and targets.
• For example, in simple linear regression, the slope parameter, $\beta_1$, predicts the change in the target, $Y$, for every unit increase in the predictor, $X$.

In supervised learning,

• target variables in regression must be continuous
• categorical target variables are modelled in classification
• regression has less or even no emphasis on using probability to describe the random variation between the predictor and the target
• Random forests are powerful tools for both classification and regression, but they do not use probability to describe the relationship between the predictors and the target.
• regression has less or even no emphasis on providing interpretable relationships between the predictors and targets.
• Neural networks are powerful tools for both classification and regression, but they do not provide interpretable relationships between the predictors and the target.

***The last 2 points are applicable to classification, too.

In general, supervised learning puts much more emphasis on accurate prediction than statistics.

Since regression in supervised learning includes only continuous targets, this results in some confusing terminology between the 2 fields.  For example, logistic regression is a commonly used technique in both statistics and supervised learning.  However, despite its name, it is a classification technique in supervised learning, because the response variable in logistic regression is categorical.

## Machine Learning Lesson of the Day – Supervised and Unsupervised Learning

The 2 most commonly used and studied categories of machine learning are supervised learning and unsupervised learning.

• In supervised learning, there is a target variable, $Y$, and a set of predictor variables, $X_1, X_2, ..., X_p$.  The goal is to use $X_1, X_2, ..., X_p$ to predict $Y$.  Supervised learning is synonymous with predictive modelling, but the latter term does not connote with learning from data to improve performance in future prediction.  Nonetheless, when I explain supervised learning to people who have some background in statistics or analytics, they usually understand what I mean when I tell them that it is just predictive modelling.
• In unsupervised learning, there are only predictor variables and no target variable.  The goal is to find interesting patterns in $X_1, X_2, ..., X_p$.  This is a much less concretely defined problem than supervised learning.  Unsupervised learning is sometimes called pattern discovery, pattern recognition, or knowledge discovery, though these are not commonly agreed upon synonyms.

## Presentation Slides: Machine Learning, Predictive Modelling, and Pattern Recognition in Business Analytics

I recently delivered a presentation entitled “Using Advanced Predictive Modelling and Pattern Recognition in Business Analytics” at the Statistical Society of Canada’s (SSC’s) Southern Ontario Regional Association (SORA) Business Analytics Seminar Series.  In this presentation, I

– discussed how traditional statistical techniques often fail in analyzing large data sets

– defined and described machine learning, supervised learning, unsupervised learning, and the many classes of techniques within these fields, as well as common examples in business analytics to illustrate these concepts

– introduced partial least squares regression and bootstrap forest (or random forest) as two examples of supervised learning (0r predictive modelling) techniques that can effectively overcome the common failures of traditional statistical techniques and can be easily implemented in JMP

– illustrated how partial least squares regression and bootstrap forest were successfully used to solve some major problems for 2 different clients at Predictum, where I currently work as a statistician

## Presentation Slides – Overcoming Multicollinearity and Overfitting with Partial Least Squares Regression in JMP and SAS

My slides on partial least squares regression at the Toronto Area SAS Society (TASS) meeting on September 14, 2012, can be found here.

#### My Presentation on Partial Least Squares Regression

My first presentation to Toronto Area SAS Society (TASS) was delivered on September 14, 2012.  I introduced a supervised learning/predictive modelling technique called partial least squares (PLS) regression; I showed how normal linear least squares regression is often problematic when used with big data because of multicollinearity and overfitting, explained how partial least squares regression overcomes these limitations, and illustrated how to implement it in SAS and JMP.  I also highlighted the variable importance for projection (VIP) score that can be used to conduct variable selection with PLS regression; in particular, I documented its effectiveness as a technique for variable selection by comparing some key journal articles on this issue in academic literature.

The green line is an overfitted classifier.  Not only does it model the underlying trend, but it also models the noise (the random variation) at the boundary.  It separates the blue and the red dots perfectly for this data set, but it will classify very poorly on a new data set from the same population.