## Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

• its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

$\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}$

• its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

$\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}$

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

## Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of

$\text{Sensitivity}$

on the vertical axis and

$1 - \text{Specificity}$

on the horizontal axis, the line of no discrimination is the line that passes through the points

$(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)$

and

$(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1)$.

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution $\text{Bernoulli}(0.5)$ is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of $1 - \text{Specificity}$, the $\text{Sensitivity}$ of the binary classifier is higher than the $\text{Sensitivity}$ of the line of no discrimination.)

## Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves

A receiver operating characteristic (ROC) curve is a 2-dimensional plot of the $\text{Sensitivity}$ (the true positive rate) versus $1 - \text{Specificity}$ (1 minus the true negative rate) of a binary classifier while varying its discrimination threshold.  In statistics and machine learning, a basic and popular tool for binary classification is logistic regression, and an ROC curve is a useful way to assess the predictive accuracy of the logistic regression model.

To illustrate with an example, let’s consider the Bernoulli response variable $Y$ and the covariates $X_1, X_2, ..., X_p$.  A logistic regression model takes the covariates as inputs and returns $P(Y = 1)$.  You as the user of the model must decide above which value of $P(Y = 1)$ you will predict that $Y = 1$; this value is the discrimination threshold.  A common threshold is $P(Y = 1) = 0.5$.

Once you finish fitting the model with a training set, you can construct an ROC curve by following these steps below:

1. Set a discrimination threshold.
2. Use the covariates to predict $Y$ for each observation in a validation set.
3. Since you have the actual response values in the validation set, you can then calculate the sensitivity and specificity for your logistic regression model at that threshold.
4. Repeat Steps 1-3 with a new threshold.
5. Plot the values of $\text{Sensitivity}$ versus $1 - \text{Specificity}$ for all thresholds.  The result is your ROC curve.

The use of a validation set to assess the predictive accuracy of a model is called validation, and it is a good practice for supervised learning.  If you have another fresh data set, it is also good practice to use that as a test set to assess the predictive accuracy of your model.

Note that you can perform Steps 2-5 for the training set, too – this is often done in statistics when you don’t have many data to work with, and the best that you can do is to assess the predictive accuracy of your model on the data set that you used to fit the model.

## Machine Learning and Applied Statistics Lesson of the Day – Sensitivity and Specificity

To evaluate the predictive accuracy of a binary classifier, two useful (but imperfect) criteria are sensitivity and specificity.

Sensitivity is the proportion of truly positives cases that were classified as positive; thus, it is a measure of how well your classifier identifies positive cases.  It is also known as the true positive rate.  Formally,

$\text{Sensitivity} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives + Number of False Negatives)}$

Specificity is the proportion of truly negative cases that were classified as negative; thus, it is a measure of how well your classifier identifies negative cases.  It is also known as the true negative rate.  Formally,

$\text{Specificity} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives + Number of False Positives)}$

## Machine Learning Lesson of the Day – Estimating Coefficients in Linear Gaussian Basis Function Models

Recently, I introduced linear Gaussian basis function models as a suitable modelling technique for supervised learning problems that involve non-linear relationships between the target and the predictors.  Recall that linear basis function models are generalizations of linear regression that regress the target on functions of the predictors, rather than the predictors themselves.  In linear regression, the coefficients are estimated by the method of least squares.  Thus, it is natural that the estimation of the coefficients in linear Gaussian basis function models is an extension of the method of least squares.

The linear Gaussian basis function model is

$Y = \Phi \beta + \varepsilon$,

where $\Phi_{ij} = \phi_j (x_i)$.  In other words, $\Phi$ is the design matrix, and the element in row $i$ and column $j$ of this design matrix is the $i\text{th}$ predictor being evaluated in the $j\text{th}$ basis function.  (In this case, there is 1 predictor per datum.)

Applying the method of least squares, the coefficient vector, $\beta$, can be estimated by

$\hat{\beta} = (\Phi^{T} \Phi)^{-1} \Phi^{T} Y$.

Note that this looks like the least-squares estimator for the coefficient vector in linear regression, except that the design matrix is not $X$, but $\Phi$.

If you are not familiar with how $\hat{\beta}$ was obtained, I encourage you to review least-squares estimation and the derivation of the estimator of the coefficient vector in linear regression.

## Machine Learning Lesson of the Day – Linear Gaussian Basis Function Models

I recently introduced the use of linear basis function models for supervised learning problems that involve non-linear relationships between the predictors and the target.  A common type of basis function for such models is the Gaussian basis function.  This type of model uses the kernel of the normal (or Gaussian) probability density function (PDF) as the basis function.

$\phi_j(x) = exp[-(x - \mu_j)^2 \div 2\sigma^2]$

The $\sigma$ in this basis function determines the spacing between the different basis functions that combine to form the model.

Notice that this is just the normal PDF without the scaling factor of $1/\sqrt{2\pi \sigma^2}$; the scaling factor ensures that the normal PDF integrates to 1 over its support set.  In a linear basis function model, the regression coefficients are the weights for the basis functions, and these weights will scale Gaussian basis functions to fit the data that are local to $\mu_j$.  Thus, there is no need to include that scaling factor of $1/\sqrt{2\pi \sigma^2}$, because the scaling is already being handled by the regression coefficients.

The Gaussian basis function model is useful because

• it can model many non-linear relationships between the predictor and the target surprisingly well,
• each basis function is non-zero over a very small interval and is zero everywhere else.  These local basis functions result in a very sparse design matrix (i.e. one with mostly zeros) that leads to much faster computation.

## Machine Learning Lesson of the Day – Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

## Machine Learning Lesson of the Day – Introduction to Linear Basis Function Models

Given a supervised learning problem of using $p$ inputs ($x_1, x_2, ..., x_p$) to predict a continuous target $Y$, the simplest model to use would be linear regression.  However, what if we know that the relationship between the inputs and the target is non-linear, but we are unsure of exactly what form this relationship has?

One way to overcome this problem is to use linear basis function models.  These models assume that the target is a linear combination of a set of $p+1$ basis functions.

$Y_i = w_0 + w_1 \phi_1(x_1) + w_2 \phi_2(x_2) + ... + w_p \phi_p(x_p)$

This is a generalization of linear regression that essentially replaces each input with a function of the input.  (A linear basis function model that uses the identity function is just linear regression.)

The type of basis functions (i.e. the type of function given by $\phi$) is chosen to suitably model the non-linearity in the relationship between the inputs and the target.  It also needs to be chosen so that the computation is efficient.  I will discuss variations of linear basis function models in a later Machine Learning Lesson of the Day.

## Machine Learning Lesson of the Day – Memory-Based Learning

Memory-based learning (also called instance-based learning) is a type of non-parametric algorithm that compares new test data with training data in order to solve the given machine learning problem.  Such algorithms search for the training data that are most similar to the test data and make predictions based on these similarities.  (From what I have learned, memory-based learning is used for supervised learning only.  Can you think of any memory-based algorithms for unsupervised learning?)

A distinguishing feature of memory-based learning is its storage of the entire training set.  This is computationally costly, especially if the training set is large – the storage itself is costly, and the complexity of the model grows with a larger data set.  However, it is advantageous because it uses less assumptions than parametric models, so it is adaptable to problems for which the assumptions may fail and no clear pattern is known ex ante.  (In contrast, parametric models like linear regression make generalizations about the training data; after building a model to predict the targets, the training data are discarded, so there is no need to store them.)  Thus, I recommend using memory-based learning algorithms when the data set is relatively small and there is no prior knowledge or information about the underlying patterns in the data.

Two classic examples of memory-based learning are K-nearest neighbours classification and K-nearest neighbours regression.

## Machine Learning Lesson of the Day – K-Nearest Neighbours Regression

I recently introduced the K-nearest neighbours classifier.  Some slight adjustments to the same algorithm can make it into a regression technique.

Given a training set and a new input $X$, we can predict the target of the new input by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$ by Euclidean distance
2. build a linear regression model to predict the target for $X$
• the K data are the predictors
• the reciprocals of the predictors’ distances to $X$ are their respective regression coefficients (the “weights”)

Validation or cross-validation can be used to determine the best number of “K”.

## Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

The K-nearest neighbours (KNN) classifier is a non-parametric classification technique that classifies an input $X$ by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$
2. counting the number of “neighbours” that belong to each class of the target variable
3. classifying $X$ by the most common class to which its neighbours belong

K is usually an odd number to avoid resolving ties.

The proximity of the neighbours to $X$ is usually defined by Euclidean distance.

Validation or cross-validation can be used to determine the best number of “K”.

## Machine Learning Lesson of the Day – Overfitting

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set.  That underlying trend cannot be precisely captured because of the random variation in the data around that trend.  A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation.  An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

• A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target.  If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already.  Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise.  Thus, the new model with these added irrelevant predictors describes the trend and the noise.  It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post.  Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.

## Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

A model is a simplified representation of reality, and the simplifications are made to discard unnecessary detail and allow us to focus on the aspect of reality that we want to understand.  These simplifications are grounded on assumptions; these assumptions may hold in some situations, but may not hold in other situations.  This implies that a model that explains a certain situation well may fail in another situation.  In both statistics and machine learning, we need to check our assumptions before relying on a model.

The “No Free Lunch” theorem states that there is no one model that works best for every problem.  The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem.  This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model.  A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

Depending on the problem, it is important to assess the trade-offs between speed, accuracy, and complexity of different models and algorithms and find a model that works best for that particular problem.

## Machine Learning Lesson of the Day – Cross-Validation

Validation is a good way to assess the predictive accuracy of a supervised learning algorithm, and the rule of thumb of using 70% of the data for training and 30% of the data for validation generally works well.  However, what if the data set is not very large, and the small amount of data for training results in high sampling error?  A good way to overcome this problem is K-fold cross-validation.

Cross-validation is best defined by describing its steps:

For each model under consideration,

1. Divide the data set into K partitions.
2. Designate the first partition as the validation set and designate the other partitions as the training set.
3. Use training set to train the algorithm.
4. Use the validation set to assess the predictive accuracy of the algorithm; the common measure of predictive accuracy is mean squared error.
5. Repeat Steps 2-4 for the second partition, third partition, … , the (K-1)th partition, and the Kth partition.  (Essentially, rotate the designation of validation set through every partition.)
6. Calculate the average of the mean squared error from all K validations.

Compare the average mean squared errors of all models and pick the one with the smallest average mean squared error as the best model.  Test all models on a separate data set (called the test set) to assess their predictive accuracies on new, fresh data.

If there are N data in the data set, and K = N, then this type of K-fold cross-validation has a special name: leave-one-out cross-validation (LOOCV).

There some trade-offs between a large and a small K.  The estimator for the prediction error from a larger K results in

• less bias because of more data being used for training
• higher variance because of the higher similarity and lower diversity between the training sets
• slower computation because of more data being used for training

In The Elements of Statistical Learning (2009 Edition, Chapter 7, Page 241-243), Hastie, Tibshirani and Friedman recommend 5 or 10 for K.

## Machine Learning Lesson of the Day – Parametric vs. Non-Parametric Models

A machine learning algorithm can be classified as either parametric or non-parametric.

A parametric algorithm has a fixed number of parameters.  A parametric algorithm is computationally faster, but makes stronger assumptions about the data; the algorithm may work well if the assumptions turn out to be correct, but it may perform badly if the assumptions are wrong.  A common example of a parametric algorithm is linear regression.

In contrast, a non-parametric algorithm uses a flexible number of parameters, and the number of parameters often grows as it learns from more data.  A non-parametric algorithm is computationally slower, but makes fewer assumptions about the data.  A common example of a non-parametric algorithm is K-nearest neighbour.

To summarize, the trade-offs between parametric and non-parametric algorithms are in computational cost and accuracy.

## Machine Learning Lesson of the Day – Babies and Non-Statisticians Practice Unsupervised Learning All the Time!

My recent lesson on unsupervised learning may make it seem like a rather esoteric field, with attempts to categorize it using words like “clustering“, “density estimation“, or “dimensionality reduction“.  However, unsupervised learning is actually how we as human beings often learn about the world that we live in – whether you are a baby learning what to eat or someone reading this blog.

• Babies use their mouths and their sense of taste to explore the world, and they can probably determine what satisfies their hunger and what doesn’t pretty quickly.  As they expose themselves to different objects – a formula bottle, a pacifier, a mother’s breast, their own fingers – their taste and digestive system are recognizing these inputs and detecting patterns of what satisfies their hunger and what doesn’t.  This all happens before they even fully understand what “food” or “hunger” means.  This will probably happen before someone says “This is food” to them and they have the language capacity to know what those 3 words mean.
• When a baby finall realizes what hunger feels like and develops the initiative to find something to eat, then that becomes a supervised learning problem: What attributes about an object will help me to determine if it’s food or not?
• I recent wrote a page called “About this Blog” to categorize the different types of posts that I have written on this blog so far.  I did not aim to predict anything about any blog post; I simply wanted to organize the 50-plus blog posts into a few categories and make it easier for you to find them.  I ultimately clustered my blog posts into 4 mutually exclusive categories (now with some overlaps).  You can think of each blog post as a vector-valued input, and I chose 2 elements – the length and the topic – of each vector to find a way to group them into classes that are very similar in length and topic within each class and very different in length and topic between the classes.  (I used those 2 elements – or features – to maximize the similarities within each category and minimized the dissimilarities between the 4 categories.)  There were other features that I could have used – whether it had an image (binary feature), the number of colours of the fonts (integer-valued feature), the time of publication of the post (continuous feature) – but length and topic were sufficient for me to arrive at the 4 categories of “Tutorials”, “Lessons”, “Advice”, and “Notifications about Presentations and Appearances at Upcoming Events”.

## Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning

Supervised learning puts a lot of emphasis on building a model that has high predictive accuracy.  Validation is a good method for assessing a model’s predictive accuracy.

Validation is the use of one part of your data set to build your model and another part of your data set to assess the model’s predictive accuracy.  Specifically,

1. split your data set into 2 sets: a training set and a validation set
2. use the training set to fit your model (e.g. LASSO regression)
3. use the predictors in the validation set to predict the targets
4. use some error measure (e.g mean squared error) to assess the differences between the predicted targets and the actual targets.

A good rule of thumb is to use 70% of your data for training and 30% of your data for your validation.

You should do this for several models (e.g. several different values of the penalty parameter in LASSO regression).  The model with the lowest mean squared error can be judged as the best model.

I highly encourage you to test your models on a separate data set – called a test set – from the same population or probability distribution and assess their predictive accuracies on the test set.  This is a good way to check for any overfitting in your models.

## Machine Learning Lesson of the Day: Clustering, Density Estimation and Dimensionality Reduction

I struggle to categorize unsupervised learning.  It is not an easily defined field, and it is also hard to find generalizations of techniques that are exhaustive and mutually exclusive.

Nonetheless, here are some categories of unsupervised learning that cover many of its commonly used techniques.  I learned this categorization from Mathematical Monk, who posted a great set of videos on machine learning on Youtube.

• Clustering: Categorize the observed variables $X_1, X_2, ..., X_p$ into groups that maximize some similarity criterion, or, equivalently, minimize some dissimilarity criterion.
• Density Estimation: Use statistical models to find an underlying probability distribution that gives rise to the observed variables.
• Dimensionality Reduction: Find a smaller set of variables that captures the essential variations or patterns of the observed variables.  This smaller set of variables may be just a subset of the observed variables, or it may be a set of new variables that better capture the underlying variation of the observed variables.

Are there any other categories that you can think of?  How would you categorize hidden Markov models?  Your input is welcomed and appreciated in the comments!

## Machine Learning Lesson of the Day – Supervised Learning: Classification and Regression

Supervised learning has 2 categories:

• In classification, the target variable is categorical.
• In regression, the target variable is continuous.

Thus, regression in statistics is different from regression in supervised learning.

In statistics,

• regression is used to model relationships between predictors and targets, and the targets could be continuous or categorical.
• a regression model usually includes 2 components to describe such relationships:
• a systematic component
• a random component.  The random component of this relationship is mathematically described by some probability distribution.
• most regression models in statistics also have assumptions about the between the predictors and/or between the observations.
• many statistical models also aim to provide interpretable relationships between the predictors and targets.
• For example, in simple linear regression, the slope parameter, $\beta_1$, predicts the change in the target, $Y$, for every unit increase in the predictor, $X$.

In supervised learning,

• target variables in regression must be continuous
• categorical target variables are modelled in classification
• regression has less or even no emphasis on using probability to describe the random variation between the predictor and the target
• Random forests are powerful tools for both classification and regression, but they do not use probability to describe the relationship between the predictors and the target.
• regression has less or even no emphasis on providing interpretable relationships between the predictors and targets.
• Neural networks are powerful tools for both classification and regression, but they do not provide interpretable relationships between the predictors and the target.

***The last 2 points are applicable to classification, too.

In general, supervised learning puts much more emphasis on accurate prediction than statistics.

Since regression in supervised learning includes only continuous targets, this results in some confusing terminology between the 2 fields.  For example, logistic regression is a commonly used technique in both statistics and supervised learning.  However, despite its name, it is a classification technique in supervised learning, because the response variable in logistic regression is categorical.

## Machine Learning Lesson of the Day – Supervised and Unsupervised Learning

The 2 most commonly used and studied categories of machine learning are supervised learning and unsupervised learning.

• In supervised learning, there is a target variable, $Y$, and a set of predictor variables, $X_1, X_2, ..., X_p$.  The goal is to use $X_1, X_2, ..., X_p$ to predict $Y$.  Supervised learning is synonymous with predictive modelling, but the latter term does not connote with learning from data to improve performance in future prediction.  Nonetheless, when I explain supervised learning to people who have some background in statistics or analytics, they usually understand what I mean when I tell them that it is just predictive modelling.
• In unsupervised learning, there are only predictor variables and no target variable.  The goal is to find interesting patterns in $X_1, X_2, ..., X_p$.  This is a much less concretely defined problem than supervised learning.  Unsupervised learning is sometimes called pattern discovery, pattern recognition, or knowledge discovery, though these are not commonly agreed upon synonyms.