Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function f(x) that is continuous on the interval [x_1, x_2], where x_1 and x_2 are any 2 points in the domain of f(x).  Let

x_m = 0.5x_1 + 0.5x_2

be the midpoint of x_1 and x_2.  Then, if

f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),

then f(x) is defined to be midpoint convex.

More generally, let’s consider any point within the interval [x_1, x_2].  We can denote this arbitrary point as

x_\lambda = \lambda x_1 + (1 - \lambda)x_2, where 0 < \lambda < 1.

Then, if

f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be convex.  If

f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable Y with a finite expected value and for any convex function g(y),

E[g(Y)] \geq g[E(Y)].

A function f(x) is defined to be concave if -f(x) is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable Z with a finite expected value and for any concave function h(z),

E[h(Z)] \leq h[E(Z)].

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of n independent and identically distributed random variables, X_1, X_2, ..., X_n,

P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.

In other words, the empirical CDF of X_1, X_2, ..., X_n converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

P[|X - \mu| \geq k \sigma] \leq 1 \div k^2

Markov’s inequality roughly says that a random variable X is most frequently observed near its expected value, \mu.  Remarkably, it quantifies just how often X is far away from \mu.  Chebyshev’s inequality goes one step further and quantifies that distance between X and \mu in terms of the number of standard deviations away from \mu.  It roughly says that the probability of X being k standard deviations away from \mu is at most k^{-2}.  Notice that this upper bound decreases as k increases – confirming our intuition that it is highly improbable for X to be far away from \mu.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable X, as long as E(X) and V(X) are finite.  (Markov’s inequality requires only E(X) to be finite.)  This is quite a marvelous result!

Mathematical Statistics Lesson of the Day – Chebyshev’s Inequality

The variance of a random variable X is just an expected value of a function of X.  Specifically,

V(X) = E[(X - \mu)^2], \ \text{where} \ \mu = E(X).

Let’s substitute (X - \mu)^2 into Markov’s inequality and see what happens.  For convenience and without loss of generality, I will replace the constant c with another constant, b^2.

\text{Let} \ b^2 = c, \ b > 0. \ \ \text{Then,}

P[(X - \mu)^2 \geq b^2] \leq E[(X - \mu)^2] \div b^2

P[ (X - \mu) \leq -b \ \ \text{or} \ \ (X - \mu) \geq b] \leq V(X) \div b^2

P[|X - \mu| \geq b] \leq V(X) \div b^2

Now, let’s substitute b with k \sigma, where \sigma is the standard deviation of X.  (I can make this substitution, because \sigma is just another constant.)

\text{Let} \ k \sigma = b. \ \ \text{Then,}

P[|X - \mu| \geq k \sigma] \leq V(X) \div k^2 \sigma^2

P[|X - \mu| \geq k \sigma] \leq 1 \div k^2

This last inequality is known as Chebyshev’s inequality, and it is just a special version of Markov’s inequality.  In a later Statistics Lesson of the Day, I will discuss the motivation and intuition behind it.  (Hint: Read my earlier lesson on the motivation and intuition behind Markov’s inequality.)

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

P(X \geq c) \leq E(X) \div c, where c > 0.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable X “far” away from its “centre” or “central value”?

Intuitively, the “central value” of X is the value that of X that is most commonly (or most frequently) observed.  Thus, as X deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, E(X), is a measure of the “centre” of X.  Thus, we would expect that the probability of X being very far away from E(X) is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As c becomes really far away from E(X), the event X \geq c becomes less probable.

You can confirm this by substituting several key values of c.

 

  • If c = E(X), then P[X \geq E(X)] \leq 1; this is the highest upper bound that P(X \geq c) can get.  This makes intuitive sense; X is going to be frequently observed near its own expected value.

 

  • If c \rightarrow \infty, then P(X \geq \infty) \leq 0.  By Kolmogorov’s axioms of probability, any probability must be inclusively between 0 and 1, so P(X \geq \infty) = 0.  This makes intuitive sense; there is no possible way that X can be bigger than positive infinity.

Mathematical Statistics Lesson of the Day – Markov’s Inequality

Markov’s inequality is an elegant and very useful inequality that relates the probability of an event concerning a non-negative random variable, X, with the expected value of X.  It states that

P(X \geq c) \leq E(X) \div c,

where c > 0.

I find Markov’s inequality to be beautiful for 2 reasons:

  1. It applies to both continuous and discrete random variables.
  2. It applies to any non-negative random variable from any distribution with a finite expected value.

In a later lesson, I will discuss the motivation and intuition behind Markov’s inequality, which has useful implications for understanding a data set.

Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

y = mx + b,

with y being called the dependent variable and x being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables X_1 and X_2 are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for Y and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for x_1, x_2, .., x_p.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of

h(t) = f(t) \div S(t).

In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity to be written as a function of the other.

I am excited to continue adding to my Youtube channel‘s collection of video tutorials.  Please stay tuned for more!

You can also watch this new video below the fold!

Read more of this post

Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

  • n random variables, X_1, ..., X_n, are independent and identically distributed,
  • n is sufficiently large,

then the distribution of their sample mean, \bar{X_n}, is approximately normal, and this approximation is better as n increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of X_1, ..., X_n.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

You can also watch the video below the fold!

Read more of this post

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if X_1, X_2, ..., X_n is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

Introduction

Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the Beta(2, 5) distribution actually integrates to 1 over its support set.  This post follows from my previous post on trapezoidal integration.

midpoint rule

Image courtesy of Qef from Wikimedia Commons.

Conceptual Background of Rectangular Integration (a.k.a. The Midpoint Rule)

Rectangular integration is a numerical integration technique that approximates the integral of a function with a rectangle.  It uses rectangles to approximate the area under the curve.  Here are its features:

  • The rectangle’s width is determined by the interval of integration.
    • One rectangle could span the width of the interval of integration and approximate the entire integral.
    • Alternatively, the interval of integration could be sub-divided into n smaller intervals of equal lengths, and n rectangles would used to approximate the integral; each smaller rectangle has the width of the smaller interval.
  • The rectangle’s height is the function’s value at the midpoint of its base.
  • Within a fixed interval of integration, the approximation becomes more accurate as more rectangles are used; each rectangle becomes narrower, and the height of the rectangle better captures the values of the function within that interval.

Read more of this post

Applied Statistics Lesson of the Day – The Completely Randomized Design with 1 Factor

The simplest experimental design is the completely randomized design with 1 factor.  In this design, each experimental unit is randomly assigned to a factor level.  This design is most useful for a homogeneous population (one that does not have major differences between any sub-populations).  It is appealing because of its simplicity and flexibility – it can be used for a factor with any number of levels, and different treatments can have different sample sizes.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.  The means of the response variable in the different groups are compared; if there are significant differences, then there is evidence to suggest that the factor and the response have a causal relationship.  The single-factor analysis of variance (ANOVA) model is most commonly used to analyze the data in such an experiment, but it does assume that the data in each group have a normal distribution, and that all groups have equal variance.  The Kruskal-Wallis test is a non-parametric alternative to ANOVA in analyzing data from single-factor completely randomized experiments.

If the factor has 2 levels, you may think that an independent 2-sample t-test with equal variance can also be used to analyze the data.  This is true, but the square of the t-test statistic in this case is just the F-test statistic in a single-factor ANOVA with 2 groups.  Thus, the results of these 2 tests are the same.  ANOVA generalizes the independent 2-sample t-test with equal variance to more than 2 groups.

Some textbooks state that “random assignment” means random assignment of experimental units to treatments, whereas other textbooks state that it means random assignment of treatments to experimental units.  I don’t think that there is any difference between these 2 definitions, but I welcome your thoughts in the comments.

Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

Introduction

Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) probability density function actually integrates to 1 over its support set.  Fully commented and readily usable R code will be provided at the end.

beta pdf

Given a probability density function (PDF) and its support set as vectors in an array programming language like R, how do you integrate the PDF over its support set to ensure that it equals to 1?  Read the rest of this post to view my own R function to implement trapezoidal integration and learn how to use it to numerically approximate integrals.

Read more of this post

Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

ecdf standard normal

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

Read more of this post

Checking for Normality with Quantile Ranges and the Standard Deviation

Introduction

I was reading Michael Trosset’s “An Introduction to Statistical Inference and Its Applications with R”, and I learned a basic but interesting fact about the normal distribution’s interquartile range and standard deviation that I had not learned before.  This turns out to be a good way to check for normality in a data set.

In this post, I introduce several traditional ways of checking for normality (or goodness of fit in general), talk about the method that I learned from Trosset’s book, then build upon this method by possibly coming up with a new way to check for normality.  I have not fully established this idea, so I welcome your thoughts and ideas.

Read more of this post

Some Subtle and Nuanced Concepts about Simple Linear Regression

Introduction

This blog post will focus on some conceptual foundations of simple linear regression, a very common technique in statistics and a precursor for understanding multiple linear regression.  I will expose and clarify many nuances and subtleties that I did not fully absorb until my Master’s degree in statistics at the University of Toronto.

What is Simple Linear Regression?

Simple linear regression is a predictive model that uses a predictor variable (x) to predict a continuous target variable (Y).  It is a formal and rigorous way to express 2 fundamental components of a statistical predictive model.

1) For each value of x, there is a probability distribution of Y.

2) The means of the probability distributions for all values of Y vary with x in a systematic way.

Mathematically, the first component is reflected in a random error variable, and the second component is reflected in the constant that expresses the linear relationship between x and Y.  These two components add together to give the following mathematical model.

Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \ \ \ i = 1,...,n

\varepsilon_i \sim Normal(0, \sigma^2)

\varepsilon_i \perp \varepsilon_j, \ \ \ \ \ i \neq j

The last mathematical expression states that two different error terms are statistically independent.

Essentially, this model captures the tendency for Y to vary systematically with x.  The systematic part is the constant term, \beta_0 + \beta_1 x_i.  The tendency (rather than a direct relation) is reflected in the probability distribution of the error component.

Note that I capitalized the target Y because it is a random variable.  (It is a linear combination of the random error, so it is also a random variable.)  I used lower-case for the predictor because it is a constant in the model.

What are the Assumptions of Simple Linear Regression?

1) The predictor variable is a fixed constant with no random variation.  If you want to model the predictor as a random variable, use the errors-in-variables model (a.k.a. measurement errors model).

2) The target variable is a linear combination of the regression coefficients and the predictor.

3) The variance of the random error component is constant.  This assumptions is called homoscedasticity.

4) The random errors are independent of each other.

5) The regression coefficients are constants.  If you want to model the regression coefficients as random variables, use the random effects model.  If you want to include both fixed and random coefficients in your model, use the mixed effects model.  The documentation for PROX MIXED in SAS/STAT has a nice explanation of mixed effects model.  I also recommend the documentation for PROC GLM for more about the random effects model.

***6) The random errors are normally distributed with an expected value of 0 and a variance of \sigma^2 .  As Assumption #3 states, this variance is constant for all \varepsilon_i, \ i = 1,...,n .

***This last assumption is not needed for the least-squares estimation of the regression coefficients.  However, it is needed for conducting statistical inference for the regression coefficients, such as testing hypotheses and constructing confidence intervals.

Important Clarifications about the Terminology

Let me clarify some common confusion about the 2 key terms in the name “simple linear regression”.

- It is called “simple” because it uses only one predictor, whereas multiple linear regression uses multiple predictors.  While it is relatively simple to understand, and while it is a simple model compared to other predictive models, there are many concepts and nuances behind linear regression that still makes it difficult to understand for many people.  (I hope that this blog post will make it easier to understand this model!)

- It is called “linear” because the target variable is linear with respect to the parameters \beta_0 and \beta_1  (the regression coefficients)not because it is linear with respect to the predictor; this is a very common misunderstanding, and I did not learn this until the second course in which I learned about linear regression.  This is more than just a naming custom; it implies that the regression coefficients can be estimated using linear algebra, which has many benefits that will be described in a later post.

Simple linear regression does assume that the target variable has a linear relationship with the predictor variable.  However, if it doesn’t, it can often be resolved – the predictor and/or the target can often be transformed to make the relationship linear.  If, however, the target variable cannot be written as a linear combination of the parameters \beta_0 and \beta_1 , then the model is no longer linear regressioneven if the target is linear with respect to the predictor.

How are the Regression Coefficients Estimated?

The regression coefficients are estimated by finding values of \beta_0 and \beta_1 that minimize the sum of the squares of the deviations from the regression line to the data.  My first linear regression textbook, “Applied Linear Statistical Models” by Kutner, Nachtsheim, Neter, and Li uses the letter “Q” to denote this quantity.  This is called the method of least squares.  The word “minimize” should trigger finding the global optimizers using differential calculus.

Q = \sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2

Differentiate Q with respect to \beta_0 and \beta_1; set the 2 derivatives to zero to get the normal equations.  The estimates are obtained by solving this system of 2 equations.

Why is the Least-Squares Method Used to Estimate the Regression Coefficients?

A natural question arises: Why minimize the sum of the squares of the errors?  Why not minimize some other measure of the distances from the regression line to the data, like the sum of the absolute values of the errors?

Q' = \sum_{i=1}^n |y_i - \beta_0 - \beta_1 x_i|

The answer lies within the Gauss-Markov theorem, which guarantees some very attractive properties for the least-squares estimators of the regression coefficients:

- these estimators are unbiased

- out of all linear unbiased estimators, the least-squares estimators have the minimum variance

Thus, the least-squares estimators are both accurate and very precise.

Note that the Gauss-Markov theorem holds without Assumption #6 above, which states that the errors have a normal distribution with an expected value of zero and a variance of \sigma^2 .

Follow

Get every new post delivered to your Inbox.

Join 362 other followers