## Applied Statistics Lesson of the Day – What “Linear” in Linear Regression Really Means

Linear regression is one of the most commonly used tools in statistics, yet one of its fundamental features is commonly misunderstood by many non-statisticians.  I have witnessed this misunderstanding on numerous occasions in my work experience, statistical consulting and statistical education, and it is important for all statisticians to be aware of this common misunderstanding, to anticipate it when someone is about to make this mistake, and to educate that person about the correct meaning.

Consider the simple linear regression model:

$Y = \beta_0 + \beta_1x + \varepsilon$.

The “linear” in linear regression refers to the linearity between the response variable ($Y$) and the regression coefficients ($\beta_0$ and $\beta_1$).  It DOES NOT refer to the linearity between the response variable ($Y$) and the explanatory variable ($x$) This is contrary to mathematical descriptions of linear relationships; for example, when high school students learn about the equation of a line,

$y = mx + b$

the relationship is called “linear” because of the linearity between $y$ and $x$.  This is the source of the mistaken understanding about the meaning of “linear” in linear regression; I am grateful that my applied statistics professor, Dr. Boxin Tang, emphasized the statistical meaning of “linear” when he taught linear regression to me.

Why is this difference in terminology important?  A casual observer may be puzzled by this apparent nit-picking of the semantics.  This terminology is important because the estimation of the regression coefficients in a regression model depends on the relationship between the response variable and the regression coefficients.  If this relationship is linear, then the estimation is very simple and can be done analytically by linear algebra.  If not, then the estimation can be very difficult and often cannot be done analytically – numerical methods must be used, instead.

Now, one of the assumptions of linear regression is the linearity between the response variable ($Y$) and the explanatory variable ($x$).  However, what if the scatter plot of $Y$ versus $x$ reveals a non-linear relationship, such as a quadratic relationship?  In that case, the solution is simple – just replace $x$ with $x^2$.  (Admittedly, if the interpretation of the regression coefficient is important, then such interpretation becomes more difficult with this transformation.  However, if prediction of the response is the key goal, then such interpretation is not necessary, and this is not a problem.)  The important point is that the estimation of the regression coefficients can still be done by linear algebra after the transformation of the explanatory variable.