## Arnab Chakraborty on The Monty Hall Problem and Bayes’ Theorem – The Central Equilibrium – Episode 6

I am pleased to welcome Arnab Chakraborty back to my talk show, “The Central Equilibrium“, to talk about the Monty Hall Problem and Bayes’ theorem.  In this episode, he shows 2 solutions to this classic puzzle in probability, and invokes Bayes’ Theorem for the second solution.

If you have not watched Arnab’s first episode on Bayes’ theorem, then I encourage you to do that first.

Marilyn Vos Savant provided a solution to this problem in PARADE Magazine in 1990-1991.  Thousands of readers disagreed with her solution and criticized her vehemently (and incorrectly) for her error.  Some of these critics were mathematicians!  She included some of those replies and provided alternative perspectives that led to the same conclusion.  Although I am dismayed by the disrespect that some people showed in their letters to her, I am glad that a magazine column on probability was able to attract so much readership and interest.  Arnab and I referred to one of her solutions in our episode.  Thank you, Marilyn!

Enjoy this episode of “The Central Equilibrium“!

## Video Tutorial – Obtaining the Expected Value of the Exponential Distribution Using the Moment Generating Function

In this video tutorial on YouTube, I use the exponential distribution’s moment generating function (MGF) to obtain the expected value of this distribution.  Visit my YouTube channel to watch more video tutorials!

## Video Tutorial – The Moment Generating Function of the Exponential Distribution

In this video tutorial on YouTube, I derive the moment generating function (MGF) of the exponential distribution.  Visit my YouTube channel to watch more video tutorials!

## Arnab Chakraborty on Bayes’ Theorem – The Central Equilibrium – Episode 3

Arnab Chakraborty kindly came to my new talk show, “The Central Equilibrium”, to talk about Bayes’ theorem.  He introduced the concept of conditional probability, stated Bayes’ theorem in its simple and general forms, and showed an example of how to use it in a calculation.

Check it out!

## Christopher Salahub on Markov Chains – The Central Equilibrium – Episode 2

It was a great pleasure to talk to Christopher Salahub about Markov chains in the second episode of my new talk show, The Central Equilibrium!  Chris graduated from the University of Waterloo with a Bachelor of Mathematics degree in statistics.  He just finished an internship in data development at Environics Analytics, and he is starting a Master’s program in statistics at ETH Zurich in Switzerland.

Chris recommends “Introduction to Probability Models” by Sheldon Ross to learn more about probability theory and Markov chains.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams.  Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please watch the video on this blog.  You can also watch it directly on YouTube.

## Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

$P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]$

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

$P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)$

$P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757$

Now, let’s connect this probability to odds.

It is important to note that

• odds is only applicable to Bernoulli random variables (i.e. binary events)
• odds is the ratio of the probability of success to the probability of failure

For our example,

$\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185$

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

## Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.

If

$E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,$

implies that

$P \{g[T(\mathbf{X})]\} = 0] = 1,$

then $T(\mathbf{X})$ is said to be complete.  To deconstruct this esoteric mathematical statement,

1. let $g(t)$ be a measurable function
2. if you want to use $g[T(\mathbf{X})]$ to form an unbiased estimator of the zero function,
3. and if the only such function is almost surely equal to the zero function,
4. then $T(\mathbf{X})$ is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

## Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function $f(x)$ that is continuous on the interval $[x_1, x_2]$, where $x_1$ and $x_2$ are any 2 points in the domain of $f(x)$.  Let

$x_m = 0.5x_1 + 0.5x_2$

be the midpoint of $x_1$ and $x_2$.  Then, if

$f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),$

then $f(x)$ is defined to be midpoint convex.

More generally, let’s consider any point within the interval $[x_1, x_2]$.  We can denote this arbitrary point as

$x_\lambda = \lambda x_1 + (1 - \lambda)x_2,$ where $0 < \lambda < 1$.

Then, if

$f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be convex.  If

$f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable $Y$ with a finite expected value and for any convex function $g(y)$,

$E[g(Y)] \geq g[E(Y)]$.

A function $f(x)$ is defined to be concave if $-f(x)$ is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable $Z$ with a finite expected value and for any concave function $h(z)$,

$E[h(Z)] \leq h[E(Z)]$.

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

## Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of $n$ independent and identically distributed random variables, $X_1, X_2, ..., X_n$,

$P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.$

In other words, the empirical CDF of $X_1, X_2, ..., X_n$ converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

Markov’s inequality roughly says that a random variable $X$ is most frequently observed near its expected value, $\mu$.  Remarkably, it quantifies just how often $X$ is far away from $\mu$.  Chebyshev’s inequality goes one step further and quantifies that distance between $X$ and $\mu$ in terms of the number of standard deviations away from $\mu$.  It roughly says that the probability of $X$ being $k$ standard deviations away from $\mu$ is at most $k^{-2}$.  Notice that this upper bound decreases as $k$ increases – confirming our intuition that it is highly improbable for $X$ to be far away from $\mu$.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable $X$, as long as $E(X)$ and $V(X)$ are finite.  (Markov’s inequality requires only $E(X)$ to be finite.)  This is quite a marvelous result!

## Mathematical Statistics Lesson of the Day – Chebyshev’s Inequality

The variance of a random variable $X$ is just an expected value of a function of $X$.  Specifically,

$V(X) = E[(X - \mu)^2], \ \text{where} \ \mu = E(X)$.

Let’s substitute $(X - \mu)^2$ into Markov’s inequality and see what happens.  For convenience and without loss of generality, I will replace the constant $c$ with another constant, $b^2$.

$\text{Let} \ b^2 = c, \ b > 0. \ \ \text{Then,}$

$P[(X - \mu)^2 \geq b^2] \leq E[(X - \mu)^2] \div b^2$

$P[ (X - \mu) \leq -b \ \ \text{or} \ \ (X - \mu) \geq b] \leq V(X) \div b^2$

$P[|X - \mu| \geq b] \leq V(X) \div b^2$

Now, let’s substitute $b$ with $k \sigma$, where $\sigma$ is the standard deviation of $X$.  (I can make this substitution, because $\sigma$ is just another constant.)

$\text{Let} \ k \sigma = b. \ \ \text{Then,}$

$P[|X - \mu| \geq k \sigma] \leq V(X) \div k^2 \sigma^2$

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

This last inequality is known as Chebyshev’s inequality, and it is just a special version of Markov’s inequality.  In a later Statistics Lesson of the Day, I will discuss the motivation and intuition behind it.  (Hint: Read my earlier lesson on the motivation and intuition behind Markov’s inequality.)

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

$P(X \geq c) \leq E(X) \div c,$ where $c > 0$.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable $X$ “far” away from its “centre” or “central value”?

Intuitively, the “central value” of $X$ is the value that of $X$ that is most commonly (or most frequently) observed.  Thus, as $X$ deviates farther and farther from its “central value”, we would expect those distant-from-the-centre values to be less frequently observed.

Recall that the expected value, $E(X)$, is a measure of the “centre” of $X$.  Thus, we would expect that the probability of $X$ being very far away from $E(X)$ is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As $c$ becomes really far away from $E(X)$, the event $X \geq c$ becomes less probable.

You can confirm this by substituting several key values of $c$.

• If $c = E(X)$, then $P[X \geq E(X)] \leq 1$; this is the highest upper bound that $P(X \geq c)$ can get.  This makes intuitive sense; $X$ is going to be frequently observed near its own expected value.

• If $c \rightarrow \infty$, then $P(X \geq \infty) \leq 0$.  By Kolmogorov’s axioms of probability, any probability must be inclusively between $0$ and $1$, so $P(X \geq \infty) = 0$.  This makes intuitive sense; there is no possible way that $X$ can be bigger than positive infinity.

## Mathematical Statistics Lesson of the Day – Markov’s Inequality

Markov’s inequality is an elegant and very useful inequality that relates the probability of an event concerning a non-negative random variable, $X$, with the expected value of $X$.  It states that

$P(X \geq c) \leq E(X) \div c,$

where $c > 0$.

I find Markov’s inequality to be beautiful for 2 reasons:

1. It applies to both continuous and discrete random variables.
2. It applies to any non-negative random variable from any distribution with a finite expected value.

In a later lesson, I will discuss the motivation and intuition behind Markov’s inequality, which has useful implications for understanding a data set.

## Organic and Inorganic Chemistry Lesson of the Day – Diastereomers

I previously introduced the concept of chirality and how it is a property of any molecule with only 1 stereogenic centre.  (A molecule with $n$ stereogenic centres may or may not be chiral, depending on its stereochemistry.)  I also defined 2 stereoisomers as enantiomers if they are non-superimposable mirror images of each other.  (Recall that chirality in inorganic chemistry can arise in 2 different ways.)

It is possible for 2 stereoisomers to NOT be enantiomers; in fact, such stereoisomers are called diastereomers.  Yes, I recognize that defining something as the negation of something else is unusual.  If you have learned set theory or probability (as I did in my mathematical statistics classes) then consider the set of all pairs of the stereoisomers of one compound – this is the sample space.  The enantiomers form a set within this sample space, and the diastereomers are the complement of the enantiomers.

It is important to note that, while diastereomers are not mirror images of each other, they are still non-superimposable.  Diastereomers often (but not always) arise from stereoisomers with 2 or more stereogenic centres; here is an example of how they can arise.  (A pair of cis/trans-isomers are also diastereomers, despite not having any stereogenic centres.)

1) Consider a stereoisomer with 2 tetrahedral stereogenic centres and no meso isomers*.  This isomer has $2^{n = 2}$ stereoisomers, where $n = 2$ denotes the number of stereogenic centres.

2) Find one pair of enantiomers based on one of the stereogenic centres.

3) Find the other pair enantiomers based on the other stereogenic centre.

4) Take any one molecule from Step #2 and any one molecule from Step #3.  These cannot be mirror images of each other.  (One molecule cannot have 2 different mirror images of itself.)  These 2 molecules are diastereomers.

Think back to my above description of enantiomers as a proper subset within the sample space of the pairs of one set of stereoisomers.  You can now see why I emphasized that the sample space consists of pairs, since multiple different pairs of stereoisomers can form enantiomers.  In my example above, Steps #2 and #3 produced 2 subsets of enantiomers.  It should be clear by now that enantiomers and diastereomers are defined as pairs.  To further illustrate this point,

a) call the 2 molecules in Step#2 A and B.

b) call the 2 molecules in Step #3 C and D.

A and B are enantiomers.  A and C are diastereomers.  Thus, it is entirely possible for one molecule to be an enantiomer with a second molecule and a diastereomer with a third molecule.

Here is an example of 2 diastereomers.  Notice that they have the same chemical formula but different 3-dimensional orientations – i.e. they are stereoisomers.  These stereoisomers are not mirror images of each other, but they are non-superimposable – i.e. they are diastereomers.

(-)-Threose

(-)-Erythrose

Images courtesy of Popnose, DMacks and Edgar181 on Wikimedia.  For brevity, I direct you to the Wikipedia entry for diastereomers showing these 4 images in one panel.

In a later Chemistry Lesson of the Day on optical rotation (a.k.a. optical activity), I will explain what the (-) symbol means in the names of those 2 diastereomers.

*I will discuss meso isomers in a separate lesson.

## Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals.  (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.)  Today, I want to share a second video of calculating expected counts – this time, using joint probabilities.  This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table.  Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample.  This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my YouTube channel to gain a deeper understanding of how and why they can be calculated.  Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

$y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

## Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of

$h(t) = f(t) \div S(t)$.

In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity to be written as a function of the other.

I am excited to continue adding to my Youtube channel‘s collection of video tutorials.  Please stay tuned for more!

## Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

• $n$ random variables, $X_1, ..., X_n$, are independent and identically distributed,
• $n$ is sufficiently large,

then the distribution of their sample mean, $\bar{X_n}$, is approximately normal, and this approximation is better as $n$ increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of $X_1, ..., X_n$.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

## Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if $X_1, X_2, ..., X_n$ is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

## Video Tutorial – The Hazard Function is the Probability Density Function Divided by the Survival Function

In an earlier video, I introduced the definition of the hazard function and broke it down into its mathematical components.  Recall that the definition of the hazard function for events defined on a continuous time scale is

$h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t]$.

Did you know that the hazard function can be expressed as the probability density function (PDF) divided by the survival function?

$h(t) = f(t) \div S(t)$

In my new Youtube video, I prove how this relationship can be obtained from the definition of the hazard function!  I am very excited to post this second video in my new Youtube channel.