## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

$y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

## Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if $X_1, X_2, ..., X_n$ is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

## Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

#### Introduction

Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the $Beta(2, 5)$ distribution actually integrates to 1 over its support set.  This post follows from my previous post on trapezoidal integration.

Image courtesy of Qef from

#### Conceptual Background of Rectangular Integration (a.k.a. The Midpoint Rule)

Rectangular integration is a numerical integration technique that approximates the integral of a function with a rectangle.  It uses rectangles to approximate the area under the curve.  Here are its features:

• The rectangle’s width is determined by the interval of integration.
• One rectangle could span the width of the interval of integration and approximate the entire integral.
• Alternatively, the interval of integration could be sub-divided into $n$ smaller intervals of equal lengths, and $n$ rectangles would used to approximate the integral; each smaller rectangle has the width of the smaller interval.
• The rectangle’s height is the function’s value at the midpoint of its base.
• Within a fixed interval of integration, the approximation becomes more accurate as more rectangles are used; each rectangle becomes narrower, and the height of the rectangle better captures the values of the function within that interval.

## Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

#### Introduction

Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) probability density function actually integrates to 1 over its support set.  Fully commented and readily usable R code will be provided at the end.

Given a probability density function (PDF) and its support set as vectors in an array programming language like R, how do you integrate the PDF over its support set to ensure that it equals to 1?  Read the rest of this post to view my own R function to implement trapezoidal integration and learn how to use it to numerically approximate integrals.

## Checking for Normality with Quantile Ranges and the Standard Deviation

#### Introduction

I was reading Michael Trosset’s “An Introduction to Statistical Inference and Its Applications with R”, and I learned a basic but interesting fact about the normal distribution’s interquartile range and standard deviation that I had not learned before.  This turns out to be a good way to check for normality in a data set.

In this post, I introduce several traditional ways of checking for normality (or goodness of fit in general), talk about the method that I learned from Trosset’s book, then build upon this method by possibly coming up with a new way to check for normality.  I have not fully established this idea, so I welcome your thoughts and ideas.

## Some Subtle and Nuanced Concepts about Simple Linear Regression

#### Introduction

This blog post will focus on some conceptual foundations of simple linear regression, a very common technique in statistics and a precursor for understanding multiple linear regression.  I will expose and clarify many nuances and subtleties that I did not fully absorb until my Master’s degree in statistics at the University of Toronto.

#### What is Simple Linear Regression?

Simple linear regression is a predictive model that uses a predictor variable (x) to predict a continuous target variable (Y).  It is a formal and rigorous way to express 2 fundamental components of a statistical predictive model.

1) For each value of x, there is a probability distribution of Y.

2) The means of the probability distributions for all values of Y vary with x in a systematic way.

Mathematically, the first component is reflected in a random error variable, and the second component is reflected in the constant that expresses the linear relationship between x and Y.  These two components add together to give the following mathematical model.

$Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \ \ \ i = 1,...,n$

$\varepsilon_i \sim Normal(0, \sigma^2)$

$\varepsilon_i \perp \varepsilon_j, \ \ \ \ \ i \neq j$

The last mathematical expression states that two different error terms are statistically independent.

Essentially, this model captures the tendency for Y to vary systematically with x.  The systematic part is the constant term, $\beta_0 + \beta_1 x_i$.  The tendency (rather than a direct relation) is reflected in the probability distribution of the error component.

Note that I capitalized the target Y because it is a random variable.  (It is a linear combination of the random error, so it is also a random variable.)  I used lower-case for the predictor because it is a constant in the model.

#### What are the Assumptions of Simple Linear Regression?

1) The predictor variable is a fixed constant with no random variation.  If you want to model the predictor as a random variable, use the errors-in-variables model (a.k.a. measurement errors model).

2) The target variable is a linear combination of the regression coefficients and the predictor.

3) The variance of the random error component is constant.  This assumptions is called homoscedasticity.

4) The random errors are independent of each other.

5) The regression coefficients are constants.  If you want to model the regression coefficients as random variables, use the random effects model.  If you want to include both fixed and random coefficients in your model, use the mixed effects model.  The documentation for PROX MIXED in SAS/STAT has a nice explanation of mixed effects model.  I also recommend the documentation for PROC GLM for more about the random effects model.

***6) The random errors are normally distributed with an expected value of 0 and a variance of $\sigma^2$.  As Assumption #3 states, this variance is constant for all $\varepsilon_i, \ i = 1,...,n$.

***This last assumption is not needed for the least-squares estimation of the regression coefficients.  However, it is needed for conducting statistical inference for the regression coefficients, such as testing hypotheses and constructing confidence intervals.

#### Important Clarifications about the Terminology

Let me clarify some common confusion about the 2 key terms in the name “simple linear regression”.

- It is called “simple” because it uses only one predictor, whereas multiple linear regression uses multiple predictors.  While it is relatively simple to understand, and while it is a simple model compared to other predictive models, there are many concepts and nuances behind linear regression that still makes it difficult to understand for many people.  (I hope that this blog post will make it easier to understand this model!)

- It is called “linear” because the target variable is linear with respect to the parameters $\beta_0$ and $\beta_1$ (the regression coefficients)not because it is linear with respect to the predictor; this is a very common misunderstanding, and I did not learn this until the second course in which I learned about linear regression.  This is more than just a naming custom; it implies that the regression coefficients can be estimated using linear algebra, which has many benefits that will be described in a later post.

Simple linear regression does assume that the target variable has a linear relationship with the predictor variable.  However, if it doesn’t, it can often be resolved – the predictor and/or the target can often be transformed to make the relationship linear.  If, however, the target variable cannot be written as a linear combination of the parameters $\beta_0$ and $\beta_1$, then the model is no longer linear regressioneven if the target is linear with respect to the predictor.

#### How are the Regression Coefficients Estimated?

The regression coefficients are estimated by finding values of $\beta_0$ and $\beta_1$ that minimize the sum of the squares of the deviations from the regression line to the data.  My first linear regression textbook, “Applied Linear Statistical Models” by Kutner, Nachtsheim, Neter, and Li uses the letter “Q” to denote this quantity.  This is called the method of least squares.  The word “minimize” should trigger finding the global optimizers using differential calculus.

$Q = \sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2$

Differentiate Q with respect to $\beta_0$ and $\beta_1$; set the 2 derivatives to zero to get the normal equations.  The estimates are obtained by solving this system of 2 equations.

#### Why is the Least-Squares Method Used to Estimate the Regression Coefficients?

A natural question arises: Why minimize the sum of the squares of the errors?  Why not minimize some other measure of the distances from the regression line to the data, like the sum of the absolute values of the errors?

$Q' = \sum_{i=1}^n |y_i - \beta_0 - \beta_1 x_i|$

The answer lies within the Gauss-Markov theorem, which guarantees some very attractive properties for the least-squares estimators of the regression coefficients:

- these estimators are unbiased

Thus, the least-squares estimators are both accurate and very precise.

Note that the Gauss-Markov theorem holds without Assumption #6 above, which states that the errors have a normal distribution with an expected value of zero and a variance of $\sigma^2$.