Video Tutorial: Naive Bayes Classifiers

Naive Bayes classifiers are simple but powerful tools for classification in statistics and machine learning.  In this video tutorial, I use a simulated data set and illustrate the mathematical details of how this technique works.

In my recent episode on The Central Equilibrium about word embeddings and text classification, Mandy Gu used naive Bayes classifiers to determine if a sentence is toxic or non-toxic – a very common objective when moderating discussions in online forums.  If you are not familiar with naive Bayes classifiers, then I encourage you to watch this video first before watching Mandy’s episode on The Central Equilibrium.

Mandy Gu on Word Embeddings and Text Classification – The Central Equilibrium – Episode 9

I am so grateful to Mandy Gu for being a guest on The Central Equilibrium to talk about word embeddings and text classification.  She began by showing how data from text can be encoded in vectors and matrices, and then she used a naive Bayes classifier to classify sentences as toxic or non-toxic – a very common problem for moderating discussions in online forums.  I learned a lot from her in this episode, and you can learn more from Mandy on her Medium blog.

If you are not familiar with naive Bayes classifiers, then I encourage you to watch my video tutorial about this topic first.

Mitchell Boggs on Game Theory in Behavioural Ecology – The Central Equilibrium – Episode 8

Mitchell Boggs kindly talked about game theory in behavioural ecology on my talk show, “The Central Equilibrium”!  He talked about 2 key examples:

  • when animals choose to share or fight for food
  • when parents choose to care for their offspring or seek new mates to produce more offspring

These examples illustrate why seemingly disadvantageous behaviours can persist or even dominate in the animal kingdom.

Mitch recommends a book called “Are We Smart Enough to Know How Smart Animals Are?” by Frans de Waal.

Thanks for being such a great guest, Mitchell!

David Veitch on Rational vs. Irrational Numbers and Countability – The Central Equilibrium – Episode 7

I am so grateful that David Veitch appeared on my talk show, “The Central Equilibrium“, to talk about rational vs. irrational numbers.  While defining irrational numbers, he proved that \sqrt{2} is an irrational number.  He then talked about the concept of bijections while defining countability, and he showed that rational numbers are countable.

David used to work as a bond trader for Bank of America.  He writes a personal blog, and you can follow him on Twitter (@daveveitch).  He recently earned admission into the Master of Science program in statistics at the University of Toronto, and he will begin that program soon.  Congratulations, David!  Thanks for being a guest on my show!

Part 1

 

Part 2

Arnab Chakraborty on The Monty Hall Problem and Bayes’ Theorem – The Central Equilibrium – Episode 6

I am pleased to welcome Arnab Chakraborty back to my talk show, “The Central Equilibrium“, to talk about the Monty Hall Problem and Bayes’ theorem.  In this episode, he shows 2 solutions to this classic puzzle in probability, and invokes Bayes’ Theorem for the second solution.

If you have not watched Arnab’s first episode on Bayes’ theorem, then I encourage you to do that first.

Marilyn Vos Savant provided a solution to this problem in PARADE Magazine in 1990-1991.  Thousands of readers disagreed with her solution and criticized her vehemently (and incorrectly) for her error.  Some of these critics were mathematicians!  She included some of those replies and provided alternative perspectives that led to the same conclusion.  Although I am dismayed by the disrespect that some people showed in their letters to her, I am glad that a magazine column on probability was able to attract so much readership and interest.  Arnab and I referred to one of her solutions in our episode.  Thank you, Marilyn!

Enjoy this episode of “The Central Equilibrium“!

Benjamin Garden on Simple vs. Compound Interest in Finance – The Central Equilibrium – Episode 5

I am so pleased to publish this new episode of “The Central Equilibrium“, featuring Benjamin Garden.  He talked about simple and compound interest in the context of finance and investment, highlighting the power of compound interest to grow your money and to enlarge debt from credit cards.  We compared the formulas for calculating the accrued amounts under simple and compound interest, and we derived the formula for the Rule of 72, a short-cut to estimate the length of time needed to double your investment under compound interest.

Check out Ben’s blog, Twitter account (@GardenBenjamin), and Instagram account (@ben.garden) to get more advice about managing your money!

Part 1:

 

Part 2:

Layne Newhouse on representing neural networks – The Central Equilibrium – Episode 4

I am excited to present the first of a multi-episode series on neural networks on my talk show, “The Central Equilibrium”.  My guest in this series in Layne Newhouse, and he talked about how to represent neural networks. We talked about the biological motivations behind neural networks, how to represent them in diagrams and mathematical equations, and a few of the common activation functions for neural networks.

Check it out!

Video Tutorial – Obtaining the Expected Value of the Exponential Distribution Using the Moment Generating Function

In this video tutorial on YouTube, I use the exponential distribution’s moment generating function (MGF) to obtain the expected value of this distribution.  Visit my YouTube channel to watch more video tutorials!

Video Tutorial – The Moment Generating Function of the Exponential Distribution

In this video tutorial on YouTube, I derive the moment generating function (MGF) of the exponential distribution.  Visit my YouTube channel to watch more video tutorials!

Christopher Salahub on Markov Chains – The Central Equilibrium – Episode 2

It was a great pleasure to talk to Christopher Salahub about Markov chains in the second episode of my new talk show, The Central Equilibrium!  Chris graduated from the University of Waterloo with a Bachelor of Mathematics degree in statistics.  He just finished an internship in data development at Environics Analytics, and he is starting a Master’s program in statistics at ETH Zurich in Switzerland.

Chris recommends “Introduction to Probability Models” by Sheldon Ross to learn more about probability theory and Markov chains.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams.  Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please watch the video on this blog.  You can also watch it directly on YouTube.

Neil Seoni on the Fourier Transform and the Sampling Theorem – The Central Equilibrium – Episode 1

I am very excited to publish the very first episode of my new talk show, The Central Equilibrium!  My guest is Neil Seoni, an undergraduate student in electrical and computer engineering at Rice University in Houston, Texas. He has studied data science in his spare time, most notably taking a course on machine learning by Andrew Ng on Coursera. He is finishing his summer job as a Data Science Intern at Environics Analytics in Toronto, Ontario.

Neil recommends reading Don Johnson’s course notes from Rice University and his free text book to learn more about the topics covered in his episode.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams.  Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please watch the video on this blog.  You can also watch it directly on YouTube.

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

odds of all 15 home teams winning on same day

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)

P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757

Now, let’s connect this probability to odds.

It is important to note that

  • odds is only applicable to Bernoulli random variables (i.e. binary events)
  • odds is the ratio of the probability of success to the probability of failure

For our example,

\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})

\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)

\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

Mathematical Statistics Lesson of the Day – Basu’s Theorem

Today’s Statistics Lesson of the Day will discuss Basu’s theorem, which connects the previously discussed concepts of minimally sufficient statistics, complete statistics and ancillary statistics.  As before, I will begin with the following set-up.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF) or probability mass function (PMF) for X_1, X_2, ..., X_n.

Let

t = T(\mathbf{X})

be a statistics based on \textbf{X}.

Basu’s theorem states that, if T(\textbf{X}) is a complete and minimal sufficient statistic, then T(\textbf{X}) is independent of every ancillary statistic.

Establishing the independence between 2 random variables can be very difficult if their joint distribution is hard to obtain.  This theorem allows the independence between minimally sufficient statistic and every ancillary statistic to be established without their joint distribution – and this is the great utility of Basu’s theorem.

However, establishing that a statistic is complete can be a difficult task.  In a later lesson, I will discuss another theorem that will make this task easier for certain cases.

Mathematics and Applied Statistics Lesson of the Day – Contrasts

A contrast is a linear combination of a set of variables such that the sum of the coefficients is equal to zero.  Notationally, consider a set of variables

\mu_1, \mu_2, ..., \mu_n.

Then the linear combination

c_1 \mu_1 + c_2 \mu_2 + ... + c_n \mu_n

is a contrast if

c_1 + c_2 + ... + c_n = 0.

There is a reason for why I chose to use \mu as the symbol for the variables in the above notation – in statistics, contrasts provide a very useful framework for comparing multiple population means in hypothesis testing.  In a later Statistics Lesson of the Day, I will illustrate some examples of contrasts, especially in the context of experimental design.

Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF)* for X_1, X_2, ..., X_n.

Let

t = T(\mathbf{X})

be a statistic based on \mathbf{X}.

If

E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,

implies that

P \{g[T(\mathbf{X})]\} = 0] = 1,

then T(\mathbf{X}) is said to be complete.  To deconstruct this esoteric mathematical statement,

  1. let g(t) be a measurable function
  2. if you want to use g[T(\mathbf{X})] to form an unbiased estimator of the zero function,
  3. and if the only such function is almost surely equal to the zero function,
  4. then T(\mathbf{X}) is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, X_1, X_2, ..., X_n, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all n variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, T(\textbf{X}), such that

  1. T(\textbf{X}) is a sufficient statistic
  2. if U(\textbf{X}) is any other sufficient statistic, then there exists a function g such that

T(\textbf{X}) = g[U(\textbf{X})].

Note that, if there exists a one-to-one function h such that

T(\textbf{X}) = h[U(\textbf{X})],

then T(\textbf{X}) and U(\textbf{X}) are equivalent.

Mathematical Statistics Lesson of the Day – Sufficient Statistics

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF)* for X_1, X_2, ..., X_n.

Let

t = T(\mathbf{X})

be a statistic based on \mathbf{X}.  Let g_\theta(t) be the PDF for T(X).

If the conditional PDF

h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]

is independent of \theta, then T(\mathbf{X}) is a sufficient statistic for \theta.  In other words,

h_\theta(\mathbf{X}) = h(\mathbf{X}),

and \theta does not appear in h(\mathbf{X}).

Intuitively, this means that T(\mathbf{X}) contains everything you need to estimate \theta, so knowing T(\mathbf{X}) (i.e. conditioning f_\theta(x) on T(\mathbf{X})) is sufficient for estimating \theta.

Often, the sufficient statistic for \theta is a summary statistic of X_1, X_2, ..., X_n, such as their

  • sample mean
  • sample median – removed thanks to comment by Christian Robert (Xi’an)
  • sample minimum
  • sample maximum

If such a summary statistic is sufficient for \theta, then knowing this one statistic is just as useful as knowing all n data for estimating \theta.

*This above definition holds for discrete and continuous random variables.

Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function f(x) that is continuous on the interval [x_1, x_2], where x_1 and x_2 are any 2 points in the domain of f(x).  Let

x_m = 0.5x_1 + 0.5x_2

be the midpoint of x_1 and x_2.  Then, if

f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),

then f(x) is defined to be midpoint convex.

More generally, let’s consider any point within the interval [x_1, x_2].  We can denote this arbitrary point as

x_\lambda = \lambda x_1 + (1 - \lambda)x_2, where 0 < \lambda < 1.

Then, if

f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be convex.  If

f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable Y with a finite expected value and for any convex function g(y),

E[g(Y)] \geq g[E(Y)].

A function f(x) is defined to be concave if -f(x) is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable Z with a finite expected value and for any concave function h(z),

E[h(Z)] \leq h[E(Z)].

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of n independent and identically distributed random variables, X_1, X_2, ..., X_n,

P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.

In other words, the empirical CDF of X_1, X_2, ..., X_n converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.