## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as $y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

## Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a $p\text{th}$-order polynomial relationship between the response variable $Y$ and the explanatory variable $x$. $Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon$

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

## Applied Statistics Lesson of the Day – What “Linear” in Linear Regression Really Means

Linear regression is one of the most commonly used tools in statistics, yet one of its fundamental features is commonly misunderstood by many non-statisticians.  I have witnessed this misunderstanding on numerous occasions in my work experience, statistical consulting and statistical education, and it is important for all statisticians to be aware of this common misunderstanding, to anticipate it when someone is about to make this mistake, and to educate that person about the correct meaning.

Consider the simple linear regression model: $Y = \beta_0 + \beta_1x + \varepsilon$.

The “linear” in linear regression refers to the linearity between the response variable ( $Y$) and the regression coefficients ( $\beta_0$ and $\beta_1$).  It DOES NOT refer to the linearity between the response variable ( $Y$) and the explanatory variable ( $x$) This is contrary to mathematical descriptions of linear relationships; for example, when high school students learn about the equation of a line, $y = mx + b$

the relationship is called “linear” because of the linearity between $y$ and $x$.  This is the source of the mistaken understanding about the meaning of “linear” in linear regression; I am grateful that my applied statistics professor, Dr. Boxin Tang, emphasized the statistical meaning of “linear” when he taught linear regression to me.

Why is this difference in terminology important?  A casual observer may be puzzled by this apparent nit-picking of the semantics.  This terminology is important because the estimation of the regression coefficients in a regression model depends on the relationship between the response variable and the regression coefficients.  If this relationship is linear, then the estimation is very simple and can be done analytically by linear algebra.  If not, then the estimation can be very difficult and often cannot be done analytically – numerical methods must be used, instead.

Now, one of the assumptions of linear regression is the linearity between the response variable ( $Y$) and the explanatory variable ( $x$).  However, what if the scatter plot of $Y$ versus $x$ reveals a non-linear relationship, such as a quadratic relationship?  In that case, the solution is simple – just replace $x$ with $x^2$.  (Admittedly, if the interpretation of the regression coefficient is important, then such interpretation becomes more difficult with this transformation.  However, if prediction of the response is the key goal, then such interpretation is not necessary, and this is not a problem.)  The important point is that the estimation of the regression coefficients can still be done by linear algebra after the transformation of the explanatory variable.

## Applied Statistics Lesson of the Day – Basic Terminology in Experimental Design #1

The word “experiment” can mean many different things in various contexts.  In science and statistics, it has a very particular and subtle definition, one that is not immediately familiar to many people who work outside of the field of experimental design. This is the first of a series of blog posts to clarify what an experiment is, how it is conducted, and why it is so central to science and statistics.

Experiment: A procedure to determine the causal relationship between 2 variables – an explanatory variable and a response variable.  The value of the explanatory variable is changed, and the value of the response variable is observed for each value of the explantory variable.

• An experiment can have 2 or more explanatory variables and 2 or more response variables.
• In my experience, I find that most experiments have 1 response variable, but many experiments have 2 or more explanatory variables.  The interactions between the multiple explanatory variables are often of interest.
• All other variables are held constant in this process to avoid confounding.

Explanatory Variable or Factor: The variable whose values are set by the experimenter.  This variable is the cause in the hypothesis.  (*Many people call this the independent variable.  I discourage this usage, because “independent” means something very different in statistics.)

Response Variable: The variable whose values are observed by the experimenter as the explanatory variable’s value is changed.  This variable is the effect in the hypothesis.  (*Many people call this the dependent variable.  Further to my previous point about “independent variables”, dependence means something very different in statistics, and I discourage using this usage.)

Factor Level: Each possible value of the factor (explanatory variable).  A factor must have at least 2 levels.

Treatment: Each possible combination of factor levels.

• If the experiment has only 1 explanatory variable, then each treatment is simply each factor level.
• If the experiment has 2 explanatory variables, X and Y, then each treatment is a combination of 1 factor level from X and 1 factor level from Y.  Such combining of factor levels generalizes to experiments with more than 2 explanatory variables.

Experimental Unit: The object on which a treatment is applied.  This can be anything – person, group of people, animal, plant, chemical, guitar, baseball, etc.