Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables.  The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily.  In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals.  In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

You can also watch the video below the fold!

Read more of this post

Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

  • 5% in the 1st year
  • 10% in the 2nd year
  • 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of 1 + r, where r is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, G.  The average factor of growth from the 3 years is

G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n},

where r_i is the rate of return in year i, i = 1, 2, 3, ..., n.  The average annual rate of return is G - 1.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.


Returning to our example, our average factor of growth is

G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242.

Thus, our annual rate of return is G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%.


Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

  1. The first set, S_1, consists of your data x_1, x_2, ..., x_n, and this set has a sample size of n.
  2. The second, S_2,  set also has a sample size of n, but all n values are the same – let’s call this common value y.
  • What number must y be such that the sums in S_1 and S_2 are equal?  This value of y is the arithmetic mean of the first set.
  • What number must y be such that the products in S_1 and S_2 are equal?  This value of y is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i).


For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

  1. 60 km/hr for 100 km
  2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]

= 46.45 \text{ km/hr}


Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}

Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.  A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of \text{Sensitivity} on the vertical axis and 1 - \text{Specificity} on the horizontal axis, the line of no discrimination is the line that passes through the points (\text{Sensitivity} = 0, 1 - \text{Specificity} = 0) and (\text{Sensitivity} = 1, 1 - \text{Specificity} = 1).  In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution \text{Bernoulli}(0.5) is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of 1 - \text{Specificity}, the \text{Sensitivity} of the binary classifier is higher than the \text{Sensitivity} of the line of no discrimination.)

Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

y = mx + b,

with y being called the dependent variable and x being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables X_1 and X_2 are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for Y and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for x_1, x_2, .., x_p.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a p\text{th}-order polynomial relationship between the response variable Y and the explanatory variable x.

Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

The harmonic mean, H, for n positive real numbers x_1, x_2, ..., x_n is defined as

H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}.

This type of mean is useful for measuring the average of rates.  For example, consider a car travelling for 240 kilometres at 2 different speeds:

  1. 60 km/hr for 120 km
  2. 40 km/hr for another 120 km

Then its average speed for this trip is

S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds.  If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics.  The average speed is defined as

S_{avg} = \Delta \text{distance} \div \Delta \text{time}

We already have the elapsed distance – it’s 240 km.  Let’s find the time elapsed for this trip.

\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})

\Delta \text{time} = 5 \text{ hours}


S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!


Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Can Apply to the Sum

The central limit theorem (CLT) is often stated in terms of the sample mean of independent and identically distributed random variables.  An often unnoticed or forgotten aspect of the CLT is its applicability to the sample sum of those variables, too.  Since n, the sample size, is just a constant, it can be multiplied to \bar{X} to obtain \sum_{i = 1}^{n} X_i.  For a sufficiently large n, this new statistic still has an approximately normal distribution, just with a new expected value and a new variance.

\sum_{i = 1}^{n} X_i \overset{approx.}{\sim} \text{Normal} (n\mu, n\sigma^2)

Applied Statistics Lesson of the Day – What “Linear” in Linear Regression Really Means

Linear regression is one of the most commonly used tools in statistics, yet one of its fundamental features is commonly misunderstood by many non-statisticians.  I have witnessed this misunderstanding on numerous occasions in my work experience in statistical consulting and statistical education, and it is important for all statisticians to be aware of this common misunderstanding, to anticipate it when someone is about to make this mistake, and to educate that person about the correct meaning.

Consider the simple linear regression model:

Y = \beta_0 + \beta_1x + \varepsilon.

The “linear” in linear regression refers to the linearity between the response variable (Y) and the regression coefficients (\beta_0 and \beta_1).  It DOES NOT refer to the linearity between the response variable (Y) and the explanatory variable (x) This is contrary to mathematical descriptions of linear relationships; for example, when high school students learn about the equation of a line,

y = mx + b

the relationship is called “linear” because of the linearity between y and x.  This is the source of the mistaken understanding about the meaning of “linear” in linear regression; I am grateful that my applied statistics professor, Dr. Boxin Tang, emphasized the statistical meaning of “linear” when he taught linear regression to me.

Why is this difference in terminology important?  A casual observer may be puzzled by this apparent nit-picking of the semantics.  This terminology is important because the estimation of the regression coefficients in a regression model depends on the relationship between the response variable and the regression coefficients.  If this relationship is linear, then the estimation is very simple and can be done analytically by linear algebra.  If not, then the estimation can be very difficult and often cannot be done analytically – numerical methods must be used, instead.

Now, one of the assumptions of linear regression is the linearity between the response variable (Y) and the explanatory variable (x).  However, what if the scatter plot of Y versus x reveals a non-linear relationship, such as a quadratic relationship?  In that case, the solution is simple – just replace x with x^2.  (Admittedly, if the interpretation of the regression coefficient is important, then such interpretation becomes more difficult with this transformation.  However, if prediction of the response is the key goal, then such interpretation is not necessary, and this is not a problem.)  The important point is that the estimation of the regression coefficients can still be done by linear algebra after the transformation of the explantory variable.

Determining chemical concentration with standard addition: An application of linear regression in JMP – A Guest Blog Post for the JMP Blog

I am very excited to announce that I have been invited by JMP to be a guest blogger for its official blog!  My thanks to Arati Mejdal, Global Social Media Manager for the JMP Division of SAS, for welcoming me into the JMP blogging community with so much support and encouragement, and I am pleased to publish my first post on the JMP Blog!  Mark Bailey and Byron Wingerd from JMP provided some valuable feedback to this blog post, and I am fortunate to get the chance to work with and learn from them!

Following the tradition of The Chemical Statistician, this post combines my passions for statistics and chemistry by illustrating how simple linear regression can be used for the method of standard addition in analytical chemistry.  In particular, I highlight the useful capability of the “Inverse Prediction” function under “Fit Model” platform in JMP to estimate the predictor given an observed response value (i.e. estimate the value of x_i given y_i).  Check it out!

JMP blog post - standard addition

Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves

A receiver operating characteristic (ROC) curve is a 2-dimensional plot of the \text{Sensitivity} (the true positive rate) versus 1 - \text{Specificity} (1 minus the true negative rate) of a binary classifier while varying its discrimination threshold.  In statistics and machine learning, a basic and popular tool for binary classification is logistic regression, and an ROC curve is a useful way to assess the predictive accuracy of the logistic regression model.

To illustrate with an example, let’s consider the Bernoulli response variable Y and the covariates X_1, X_2, ..., X_p.  A logistic regression model takes the covariates as inputs and returns P(Y = 1).  You as the user of the model must decide above which value of P(Y = 1) you will predict that Y = 1; this value is the discrimination threshold.  A common threshold is P(Y = 1) = 0.5.

Once you finish fitting the model with a training set, you can construct an ROC curve by following these steps below:

  1. Set a discrimination threshold.
  2. Use the covariates to predict Y for each observation in a validation set.
  3. Since you have the actual response values in the validation set, you can then calculate the sensitivity and specificity for your logistic regression model at that threshold.
  4. Repeat Steps 1-3 with a new threshold.
  5. Plot the values of \text{Sensitivity} versus 1 - \text{Specificity} for all thresholds.  The result is your ROC curve.

The use of a validation set to assess the predictive accuracy of a model is called validation, and it is a good practice for supervised learning.  If you have another fresh data set, it is also good practice to use that as a test set to assess the predictive accuracy of your model.

Note that you can perform Steps 2-5 for the training set, too – this is often done in statistics when you don’t have many data to work with, and the best that you can do is to assess the predictive accuracy of your model on the data set that you used to fit the model.

Presentation on Statistical Genetics at Vancouver SAS User Group – Wednesday, May 28, 2014

I am excited and delighted to be invited to present at the Vancouver SAS User Group‘s next meeting.  I will provide an introduction to statistical genetics; specifically, I will

  • define basic terminology in genetics
  • explain the Hardy-Weinberg equilibrium in detail
  • illustrate how Pearson’s chi-squared goodness-of-fit test can be used in PROC FREQ in SAS to check the Hardy-Weinberg equilibrium
  • illustrate how the Newton-Raphson algorithm can be used for maximum likelihood estimation in PROC IML in SAS

Eric Cai - Official Head Shot








You can register for this meeting here.  The meeting’s coordinates are

9:00am – 3:00pm

Wednesday, May 28th, 2014

BC Cancer Agency Research Centre

675 West 10th Avenue.

Vancouver, BC


If you will attend this meeting, please feel free to come up and say “Hello!”.  I look forward to meeting you!

Machine Learning and Applied Statistics Lesson of the Day – Sensitivity and Specificity

To evaluate the predictive accuracy of a binary classifier, two useful (but imperfect) criteria are sensitivity and specificity.

Sensitivity is the proportion of truly positives cases that were classified as positive; thus, it is a measure of how well your classifier identifies positive cases.  It is also known as the true positive rate.  Formally,

\text{Sensitivity} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives + Number of False Negatives)}


Specificity is the proportion of truly negative cases that were classified as negative; thus, it is a measure of how well your classifier identifies negative cases.  It is also known as the true negative rate.  Formally,

\text{Specificity} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives + Number of False Positives)}

Applied Statistics Lesson and Humour of the Day – Type I Error (False Positive) and Type 2 Error (False Negative)

In hypothesis testing,

  • a Type 1 error is the rejection of the null hypothesis when it is actually true
  • a Type 2 error is the acceptance of the null hypothesis when it is actually false.  (Some statisticians prefer to say “failure to reject” rather than “accept” the null hypothesis for Type 2 errors.)

A Type 1 error is also known as a false positive, and a Type 2 error is also known as a false negative.  This nomenclature comes from the conventional connotation of

  • the null hypothesis as the “negative” or the “boring” result
  • the alternative hypothesis as the “positive” or “exciting” result.

A great way to illustrate the meaning and the intuition of Type 1 errors and Type 2 errors is the following cartoon.


Source of Image: Effect Size FAQs by Paul Ellis

In this case, the null hypothesis (or the “boring” result) is “You’re not pregnant”, and the alternative hypothesis (or the “exciting” result) is “You’re pregnant!”.


This is the most effective way to explain Type 1 error and Type 2 error that I have encountered!

Machine Learning Lesson of the Day – Estimating Coefficients in Linear Gaussian Basis Function Models

Recently, I introduced linear Gaussian basis function models as a suitable modelling technique for supervised learning problems that involve non-linear relationships between the target and the predictors.  Recall that linear basis function models are generalizations of linear regression that regress the target on functions of the predictors, rather than the predictors themselves.  In linear regression, the coefficients are estimated by the method of least squares.  Thus, it is natural that the estimation of the coefficients in linear Gaussian basis function models is an extension of the method of least squares.

The linear Gaussian basis function model is

Y = \Phi \beta + \varepsilon,

where \Phi_{ij} = \phi_j (x_i).  In other words, \Phi is the design matrix, and the element in row i and column j of this design matrix is the i\text{th} predictor being evaluated in the j\text{th} basis function.  (In this case, there is 1 predictor per datum.)

Applying the method of least squares, the coefficient vector, \beta, can be estimated by

\hat{\beta} = (\Phi^{T} \Phi)^{-1} \Phi^{T} Y.

Note that this looks like the least-squares estimator for the coefficient vector in linear regression, except that the design matrix is not X, but \Phi.

If you are not familiar with how \hat{\beta} was obtained, I encourage you to review least-squares estimation and the derivation of the estimator of the coefficient vector in linear regression.

Applied Statistics Lesson of the Day – Notation for Fractional Factorial Designs

Fractional factorial designs use the L^{F-p} notation; unfortunately, this notation is not clearly explained in most textbooks or web sites about experimental design.  I hope that my explanation below is useful.

  • L is the number of levels in each factor; note that the L^{F-p} notation assumes that all factors have the same number of levels.
    • If a factor has 2 levels, then the levels are usually coded as +1 and -1.
    • If a factor has 3 levels, then the levels are usually coded as +1, 0, and -1.
  • F is the number of factors in the experiment
  • p is the number of times that the full factorial design is fractionated by L.  This number is badly explained by most textbooks and web sites that I have seen, because they simply say that p is the fraction – this is confusing, because a fraction has a numerator and a denominator, and p is just 1 number.  To clarify,
    • the fraction is L^{-p}
    • the number of treatments in the fractional factorial design is L^{-p} multiplied by the total possible number of treatments in the full factorial design, which is L^F.

If all L^F possible treatments are used in the experiment, then a full factorial design is used.  If a fractional factorial design is used instead, then L^{-p} denotes the fraction of the L^F treatments that is used.

Most factorial experiments use binary factors (i.e. factors with 2 levels, L = 2).  Thus,

  • if p = 1, then the fraction of treatments that is used is 2^{-1} = 1/2.
  • if p = 2, then the fraction of treatments that is used is 2^{-2} = 1/4.

This is why

  • a 2^{F-1} design is often called a half-fraction design.
  • a 2^{F-2} design is often called a quarter-fraction design.

However, most sources that I have read do not bother to mention that L can be greater than 2; experiments with 3-level factors are less frequent but still common.  Thus, the terms half-fraction design and half-quarter design only apply to binary factors.  If L = 3, then

  • a 3^{F-1} design uses one-third of all possible treatments.
  • a 3^{F-2} design uses one-ninth of all possible treatments.

Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

  • n random variables, X_1, ..., X_n, are independent and identically distributed,
  • n is sufficiently large,

then the distribution of their sample mean, \bar{X_n}, is approximately normal, and this approximation is better as n increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of X_1, ..., X_n.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

You can also watch the video below the fold!

Read more of this post

Applied Statistics Lesson of the Day – Fractional Factorial Design and the Sparsity-of-Effects Principle

Consider again an experiment that seeks to determine the causal relationships between G factors and the response, where G > 1.  Ideally, the sample size is large enough for a full factorial design to be used.  However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead.  Such a design assigns the experimental units to a select fraction of the treatments; these treatments are chosen carefully to investigate the most significant causal relationships, while leaving aside the insignificant ones.  

When, then, are the significant causal relationships?  According to the sparsity-of-effects principle, it is unlikely that complex, higher-order effects exist, and that the most important effects are the lower-order effects.  Thus, assign the experimental units so that main (1st-order) effects and the 2nd-order interaction effects can be investigated.  This may neglect the discovery of a few significant higher-order effects, but that is the compromise that a fractional factorial design makes when the sample size available is low and the number of possible treatments is high.  

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if X_1, X_2, ..., X_n is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

Applied Statistics Lesson of the Day – The Independent 2-Sample t-Test with Unequal Variances (Welch’s t-Test)

A common problem in statistics is determining whether or not the means of 2 populations are equal.  The independent 2-sample t-test is a popular parametric method to answer this question.  (In an earlier Statistics Lesson of the Day, I discussed how data collected from a completely randomized design with 1 binary factor can be analyzed by an independent 2-sample t-test.  I also discussed its possible use in the discovery of argon.)  I have learned 2 versions of the independent 2-sample t-test, and they differ on the variances of the 2 samples.  The 2 possibilities are

  • equal variances
  • unequal variances

Most statistics textbooks that I have read elaborate at length about the independent 2-sample t-test with equal variances (also called Student’s t-test).  However, the assumption of equal variances needs to be checked using the chi-squared test before proceeding with the Student’s t-test, yet this check does not seem to be universally done in practice.  Furthermore, conducting one test based on the results of another results possible inflation of Type 1 error (Ruxton, 2006).

Some books give due attention to the independent 2-sample t-test with unequal variances (also called Welch’s t-test), but some barely mention its value, and others do not even mention it at all.  I find this to be puzzling, because the assumption of equal variances is often violated in practice, and Welch’s t-test provides an easy solution to this problem.  There is a seemingly intimidating but straightforward calculation to approximate the number of degrees of freedom for Welch’s t-test, and this calculation is automatically incorporated in most software, including R and SAS.  Finally, Welch’s t-test removes the need to check for equal variances, and it is almost as powerful as Student’s t-test when the variances are equal (Ruxton, 2006).

For all of these reasons, I recommend Welch’s t-test when using the parametric approach to comparing the means of 2 populations.



Graeme D. Ruxton.  “The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test“.  Behavioral Ecology (July/August 2006) 17 (4): 688-690 first published online May 17, 2006


Get every new post delivered to your Inbox.

Join 303 other followers