## Mathematical Statistics Lesson of the Day – An Example of An Ancillary Statistic

Consider 2 random variables, $X_1$ and $X_2$, from the normal distribution $\text{Normal}(\mu, \sigma^2)$, where $\mu$ is unknown.  Then the statistic

$D = X_1 - X_2$

has the distribution

$\text{Normal}(0, 2\sigma^2)$.

The distribution of $D$ does not depend on $\mu$, so $D$ is an ancillary statistic for $\mu$.

Note that, if $\sigma^2$ is unknown, then $D$ is not ancillary for $\sigma^2$.

## Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function $f(x)$ that is continuous on the interval $[x_1, x_2]$, where $x_1$ and $x_2$ are any 2 points in the domain of $f(x)$.  Let

$x_m = 0.5x_1 + 0.5x_2$

be the midpoint of $x_1$ and $x_2$.  Then, if

$f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),$

then $f(x)$ is defined to be midpoint convex.

More generally, let’s consider any point within the interval $[x_1, x_2]$.  We can denote this arbitrary point as

$x_\lambda = \lambda x_1 + (1 - \lambda)x_2,$ where $0 < \lambda < 1$.

Then, if

$f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be convex.  If

$f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable $Y$ with a finite expected value and for any convex function $g(y)$,

$E[g(Y)] \geq g[E(Y)]$.

A function $f(x)$ is defined to be concave if $-f(x)$ is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable $Z$ with a finite expected value and for any concave function $h(z)$,

$E[h(Z)] \leq h[E(Z)]$.

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

## Mathematical Statistics Lesson of the Day – Chebyshev’s Inequality

The variance of a random variable $X$ is just an expected value of a function of $X$.  Specifically,

$V(X) = E[(X - \mu)^2], \ \text{where} \ \mu = E(X)$.

Let’s substitute $(X - \mu)^2$ into Markov’s inequality and see what happens.  For convenience and without loss of generality, I will replace the constant $c$ with another constant, $b^2$.

$\text{Let} \ b^2 = c, \ b > 0. \ \ \text{Then,}$

$P[(X - \mu)^2 \geq b^2] \leq E[(X - \mu)^2] \div b^2$

$P[ (X - \mu) \leq -b \ \ \text{or} \ \ (X - \mu) \geq b] \leq V(X) \div b^2$

$P[|X - \mu| \geq b] \leq V(X) \div b^2$

Now, let’s substitute $b$ with $k \sigma$, where $\sigma$ is the standard deviation of $X$.  (I can make this substitution, because $\sigma$ is just another constant.)

$\text{Let} \ k \sigma = b. \ \ \text{Then,}$

$P[|X - \mu| \geq k \sigma] \leq V(X) \div k^2 \sigma^2$

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

This last inequality is known as Chebyshev’s inequality, and it is just a special version of Markov’s inequality.  In a later Statistics Lesson of the Day, I will discuss the motivation and intuition behind it.  (Hint: Read my earlier lesson on the motivation and intuition behind Markov’s inequality.)

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

$P(X \geq c) \leq E(X) \div c,$ where $c > 0$.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable $X$ “far” away from its “centre” or “central value”?

Intuitively, the “central value” of $X$ is the value that of $X$ that is most commonly (or most frequently) observed.  Thus, as $X$ deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, $E(X)$, is a measure of the “centre” of $X$.  Thus, we would expect that the probability of $X$ being very far away from $E(X)$ is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As $c$ becomes really far away from $E(X)$, the event $X \geq c$ becomes less probable.

You can confirm this by substituting several key values of $c$.

• If $c = E(X)$, then $P[X \geq E(X)] \leq 1$; this is the highest upper bound that $P(X \geq c)$ can get.  This makes intuitive sense; $X$ is going to be frequently observed near its own expected value.

• If $c \rightarrow \infty$, then $P(X \geq \infty) \leq 0$.  By Kolmogorov’s axioms of probability, any probability must be inclusively between $0$ and $1$, so $P(X \geq \infty) = 0$.  This makes intuitive sense; there is no possible way that $X$ can be bigger than positive infinity.

## Mathematical Statistics Lesson of the Day – Markov’s Inequality

Markov’s inequality is an elegant and very useful inequality that relates the probability of an event concerning a non-negative random variable, $X$, with the expected value of $X$.  It states that

$P(X \geq c) \leq E(X) \div c,$

where $c > 0$.

I find Markov’s inequality to be beautiful for 2 reasons:

1. It applies to both continuous and discrete random variables.
2. It applies to any non-negative random variable from any distribution with a finite expected value.

In a later lesson, I will discuss the motivation and intuition behind Markov’s inequality, which has useful implications for understanding a data set.

## Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of

$\text{Sensitivity}$

on the vertical axis and

$1 - \text{Specificity}$

on the horizontal axis, the line of no discrimination is the line that passes through the points

$(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)$

and

$(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1)$.

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution $\text{Bernoulli}(0.5)$ is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of $1 - \text{Specificity}$, the $\text{Sensitivity}$ of the binary classifier is higher than the $\text{Sensitivity}$ of the line of no discrimination.)