Christopher Salahub on Markov Chains – The Central Equilibrium – Episode 2

It was a great pleasure to talk to Christopher Salahub about Markov chains in the second episode of my new talk show, The Central Equilibrium!  Chris graduated from the University of Waterloo with a Bachelor of Mathematics degree in statistics.  He just finished an internship in data development at Environics Analytics, and he is starting a Master’s program in statistics at ETH Zurich in Switzerland.

Chris recommends “Introduction to Probability Models” by Sheldon Ross to learn more about probability theory and Markov chains.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams.  Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please view this blog post beneath the fold to watch the video on this blog.  You can also watch it directly on YouTube.

Read more of this post

Advertisements

Neil Seoni on the Fourier Transform and the Sampling Theorem – The Central Equilibrium – Episode 1

I am very excited to publish the very first episode of my new talk show, The Central Equilibrium!  My guest is Neil Seoni, an undergraduate student in electrical and computer engineering at Rice University in Houston, Texas. He has studied data science in his spare time, most notably taking a course on machine learning by Andrew Ng on Coursera. He is finishing his summer job as a Data Science Intern at Environics Analytics in Toronto, Ontario.

Neil recommends reading Don Johnson’s course notes from Rice University and his free text book to learn more about the topics covered in his episode.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams.  Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please view this blog post beneath the fold to watch the video on this blog.  You can also watch it directly on YouTube.

Read more of this post

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

odds of all 15 home teams winning on same day

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)

P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757

Now, let’s connect this probability to odds.

It is important to note that

  • odds is only applicable to Bernoulli random variables (i.e. binary events)
  • odds is the ratio of the probability of success to the probability of failure

For our example,

\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})

\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)

\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

Mathematical Statistics Lesson of the Day – Basu’s Theorem

Today’s Statistics Lesson of the Day will discuss Basu’s theorem, which connects the previously discussed concepts of minimally sufficient statistics, complete statistics and ancillary statistics.  As before, I will begin with the following set-up.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF) or probability mass function (PMF) for X_1, X_2, ..., X_n.

Let

t = T(\mathbf{X})

be a statistics based on \textbf{X}.

Basu’s theorem states that, if T(\textbf{X}) is a complete and minimal sufficient statistic, then T(\textbf{X}) is independent of every ancillary statistic.

Establishing the independence between 2 random variables can be very difficult if their joint distribution is hard to obtain.  This theorem allows the independence between minimally sufficient statistic and every ancillary statistic to be established without their joint distribution – and this is the great utility of Basu’s theorem.

However, establishing that a statistic is complete can be a difficult task.  In a later lesson, I will discuss another theorem that will make this task easier for certain cases.

Mathematics and Applied Statistics Lesson of the Day – Contrasts

A contrast is a linear combination of a set of variables such that the sum of the coefficients is equal to zero.  Notationally, consider a set of variables

\mu_1, \mu_2, ..., \mu_n.

Then the linear combination

c_1 \mu_1 + c_2 \mu_2 + ... + c_n \mu_n

is a contrast if

c_1 + c_2 + ... + c_n = 0.

There is a reason for why I chose to use \mu as the symbol for the variables in the above notation – in statistics, contrasts provide a very useful framework for comparing multiple population means in hypothesis testing.  In a later Statistics Lesson of the Day, I will illustrate some examples of contrasts, especially in the context of experimental design.

Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF)* for X_1, X_2, ..., X_n.

Let

t = T(\mathbf{X})

be a statistic based on \mathbf{X}.

If

E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,

implies that

P \{g[T(\mathbf{X})]\} = 0] = 1,

then T(\mathbf{X}) is said to be complete.  To deconstruct this esoteric mathematical statement,

  1. let g(t) be a measurable function
  2. if you want to use g[T(\mathbf{X})] to form an unbiased estimator of the zero function,
  3. and if the only such function is almost surely equal to the zero function,
  4. then T(\mathbf{X}) is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, X_1, X_2, ..., X_n, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all n variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, T(\textbf{X}), such that

  1. T(\textbf{X}) is a sufficient statistic
  2. if U(\textbf{X}) is any other sufficient statistic, then there exists a function g such that

T(\textbf{X}) = g[U(\textbf{X})].

Note that, if there exists a one-to-one function h such that

T(\textbf{X}) = h[U(\textbf{X})],

then T(\textbf{X}) and U(\textbf{X}) are equivalent.

Mathematical Statistics Lesson of the Day – Sufficient Statistics

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF)* for X_1, X_2, ..., X_n.

Let

t = T(\mathbf{X})

be a statistic based on \mathbf{X}.  Let g_\theta(t) be the PDF for T(X).

If the conditional PDF

h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]

is independent of \theta, then T(\mathbf{X}) is a sufficient statistic for \theta.  In other words,

h_\theta(\mathbf{X}) = h(\mathbf{X}),

and \theta does not appear in h(\mathbf{X}).

Intuitively, this means that T(\mathbf{X}) contains everything you need to estimate \theta, so knowing T(\mathbf{X}) (i.e. conditioning f_\theta(x) on T(\mathbf{X})) is sufficient for estimating \theta.

Often, the sufficient statistic for \theta is a summary statistic of X_1, X_2, ..., X_n, such as their

  • sample mean
  • sample median – removed thanks to comment by Christian Robert (Xi’an)
  • sample minimum
  • sample maximum

If such a summary statistic is sufficient for \theta, then knowing this one statistic is just as useful as knowing all n data for estimating \theta.

*This above definition holds for discrete and continuous random variables.

Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function f(x) that is continuous on the interval [x_1, x_2], where x_1 and x_2 are any 2 points in the domain of f(x).  Let

x_m = 0.5x_1 + 0.5x_2

be the midpoint of x_1 and x_2.  Then, if

f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),

then f(x) is defined to be midpoint convex.

More generally, let’s consider any point within the interval [x_1, x_2].  We can denote this arbitrary point as

x_\lambda = \lambda x_1 + (1 - \lambda)x_2, where 0 < \lambda < 1.

Then, if

f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be convex.  If

f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable Y with a finite expected value and for any convex function g(y),

E[g(Y)] \geq g[E(Y)].

A function f(x) is defined to be concave if -f(x) is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable Z with a finite expected value and for any concave function h(z),

E[h(Z)] \leq h[E(Z)].

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of n independent and identically distributed random variables, X_1, X_2, ..., X_n,

P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.

In other words, the empirical CDF of X_1, X_2, ..., X_n converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

P[|X - \mu| \geq k \sigma] \leq 1 \div k^2

Markov’s inequality roughly says that a random variable X is most frequently observed near its expected value, \mu.  Remarkably, it quantifies just how often X is far away from \mu.  Chebyshev’s inequality goes one step further and quantifies that distance between X and \mu in terms of the number of standard deviations away from \mu.  It roughly says that the probability of X being k standard deviations away from \mu is at most k^{-2}.  Notice that this upper bound decreases as k increases – confirming our intuition that it is highly improbable for X to be far away from \mu.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable X, as long as E(X) and V(X) are finite.  (Markov’s inequality requires only E(X) to be finite.)  This is quite a marvelous result!

Mathematical Statistics Lesson of the Day – Chebyshev’s Inequality

The variance of a random variable X is just an expected value of a function of X.  Specifically,

V(X) = E[(X - \mu)^2], \ \text{where} \ \mu = E(X).

Let’s substitute (X - \mu)^2 into Markov’s inequality and see what happens.  For convenience and without loss of generality, I will replace the constant c with another constant, b^2.

\text{Let} \ b^2 = c, \ b > 0. \ \ \text{Then,}

P[(X - \mu)^2 \geq b^2] \leq E[(X - \mu)^2] \div b^2

P[ (X - \mu) \leq -b \ \ \text{or} \ \ (X - \mu) \geq b] \leq V(X) \div b^2

P[|X - \mu| \geq b] \leq V(X) \div b^2

Now, let’s substitute b with k \sigma, where \sigma is the standard deviation of X.  (I can make this substitution, because \sigma is just another constant.)

\text{Let} \ k \sigma = b. \ \ \text{Then,}

P[|X - \mu| \geq k \sigma] \leq V(X) \div k^2 \sigma^2

P[|X - \mu| \geq k \sigma] \leq 1 \div k^2

This last inequality is known as Chebyshev’s inequality, and it is just a special version of Markov’s inequality.  In a later Statistics Lesson of the Day, I will discuss the motivation and intuition behind it.  (Hint: Read my earlier lesson on the motivation and intuition behind Markov’s inequality.)

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

P(X \geq c) \leq E(X) \div c, where c > 0.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable X “far” away from its “centre” or “central value”?

Intuitively, the “central value” of X is the value that of X that is most commonly (or most frequently) observed.  Thus, as X deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, E(X), is a measure of the “centre” of X.  Thus, we would expect that the probability of X being very far away from E(X) is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As c becomes really far away from E(X), the event X \geq c becomes less probable.

You can confirm this by substituting several key values of c.

 

  • If c = E(X), then P[X \geq E(X)] \leq 1; this is the highest upper bound that P(X \geq c) can get.  This makes intuitive sense; X is going to be frequently observed near its own expected value.

 

  • If c \rightarrow \infty, then P(X \geq \infty) \leq 0.  By Kolmogorov’s axioms of probability, any probability must be inclusively between 0 and 1, so P(X \geq \infty) = 0.  This makes intuitive sense; there is no possible way that X can be bigger than positive infinity.

Mathematical Statistics Lesson of the Day – Markov’s Inequality

Markov’s inequality is an elegant and very useful inequality that relates the probability of an event concerning a non-negative random variable, X, with the expected value of X.  It states that

P(X \geq c) \leq E(X) \div c,

where c > 0.

I find Markov’s inequality to be beautiful for 2 reasons:

  1. It applies to both continuous and discrete random variables.
  2. It applies to any non-negative random variable from any distribution with a finite expected value.

In a later lesson, I will discuss the motivation and intuition behind Markov’s inequality, which has useful implications for understanding a data set.

Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

  • 5% in the 1st year
  • 10% in the 2nd year
  • 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of 1 + r, where r is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, G.  The average factor of growth from the 3 years is

G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n},

where r_i is the rate of return in year i, i = 1, 2, 3, ..., n.  The average annual rate of return is G - 1.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

 

Returning to our example, our average factor of growth is

G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242.

Thus, our annual rate of return is G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%.

 

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

  1. The first set, S_1, consists of your data x_1, x_2, ..., x_n, and this set has a sample size of n.
  2. The second, S_2,  set also has a sample size of n, but all n values are the same – let’s call this common value y.
  • What number must y be such that the sums in S_1 and S_2 are equal?  This value of y is the arithmetic mean of the first set.
  • What number must y be such that the products in S_1 and S_2 are equal?  This value of y is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i).

 

For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

  1. 60 km/hr for 100 km
  2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]

= 46.45 \text{ km/hr}

 

Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}

Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

The harmonic mean, H, for n positive real numbers x_1, x_2, ..., x_n is defined as

H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}.

This type of mean is useful for measuring the average of rates.  For example, consider a car travelling for 240 kilometres at 2 different speeds:

  1. 60 km/hr for 120 km
  2. 40 km/hr for another 120 km

Then its average speed for this trip is

S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds.  If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics.  The average speed is defined as

S_{avg} = \Delta \text{distance} \div \Delta \text{time}

We already have the elapsed distance – it’s 240 km.  Let’s find the time elapsed for this trip.

\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})

\Delta \text{time} = 5 \text{ hours}

Thus,

S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

 

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Can Apply to the Sum

The central limit theorem (CLT) is often stated in terms of the sample mean of independent and identically distributed random variables.  An often unnoticed or forgotten aspect of the CLT is its applicability to the sample sum of those variables.  Since n, the sample size, is just a constant, it can be multiplied to \bar{X} to obtain \sum_{i = 1}^{n} X_i.  For a sufficiently large n, this new statistic still has an approximately normal distribution, just with a new expected value and a new variance.

\sum_{i = 1}^{n} X_i \overset{approx.}{\sim} \text{Normal} (n\mu, n\sigma^2)

Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of

h(t) = f(t) \div S(t).

In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity to be written as a function of the other.

I am excited to continue adding to my Youtube channel‘s collection of video tutorials.  Please stay tuned for more!

You can also watch this new video below the fold!

Read more of this post