## Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a $p\text{th}$-order polynomial relationship between the response variable $Y$ and the explanatory variable $x$.

$Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon$

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

## Machine Learning Lesson of the Day – Estimating Coefficients in Linear Gaussian Basis Function Models

Recently, I introduced linear Gaussian basis function models as a suitable modelling technique for supervised learning problems that involve non-linear relationships between the target and the predictors.  Recall that linear basis function models are generalizations of linear regression that regress the target on functions of the predictors, rather than the predictors themselves.  In linear regression, the coefficients are estimated by the method of least squares.  Thus, it is natural that the estimation of the coefficients in linear Gaussian basis function models is an extension of the method of least squares.

The linear Gaussian basis function model is

$Y = \Phi \beta + \varepsilon$,

where $\Phi_{ij} = \phi_j (x_i)$.  In other words, $\Phi$ is the design matrix, and the element in row $i$ and column $j$ of this design matrix is the $i\text{th}$ predictor being evaluated in the $j\text{th}$ basis function.  (In this case, there is 1 predictor per datum.)

Applying the method of least squares, the coefficient vector, $\beta$, can be estimated by

$\hat{\beta} = (\Phi^{T} \Phi)^{-1} \Phi^{T} Y$.

Note that this looks like the least-squares estimator for the coefficient vector in linear regression, except that the design matrix is not $X$, but $\Phi$.

If you are not familiar with how $\hat{\beta}$ was obtained, I encourage you to review least-squares estimation and the derivation of the estimator of the coefficient vector in linear regression.

## Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

A model is a simplified representation of reality, and the simplifications are made to discard unnecessary detail and allow us to focus on the aspect of reality that we want to understand.  These simplifications are grounded on assumptions; these assumptions may hold in some situations, but may not hold in other situations.  This implies that a model that explains a certain situation well may fail in another situation.  In both statistics and machine learning, we need to check our assumptions before relying on a model.

The “No Free Lunch” theorem states that there is no one model that works best for every problem.  The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem.  This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model.  A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

Depending on the problem, it is important to assess the trade-offs between speed, accuracy, and complexity of different models and algorithms and find a model that works best for that particular problem.

## Presentation Slides – Overcoming Multicollinearity and Overfitting with Partial Least Squares Regression in JMP and SAS

My slides on partial least squares regression at the Toronto Area SAS Society (TASS) meeting on September 14, 2012, can be found here.

#### My Presentation on Partial Least Squares Regression

My first presentation to Toronto Area SAS Society (TASS) was delivered on September 14, 2012.  I introduced a supervised learning/predictive modelling technique called partial least squares (PLS) regression; I showed how normal linear least squares regression is often problematic when used with big data because of multicollinearity and overfitting, explained how partial least squares regression overcomes these limitations, and illustrated how to implement it in SAS and JMP.  I also highlighted the variable importance for projection (VIP) score that can be used to conduct variable selection with PLS regression; in particular, I documented its effectiveness as a technique for variable selection by comparing some key journal articles on this issue in academic literature.

The green line is an overfitted classifier.  Not only does it model the underlying trend, but it also models the noise (the random variation) at the boundary.  It separates the blue and the red dots perfectly for this data set, but it will classify very poorly on a new data set from the same population.