Statistics Lesson and Warning of the Day – Confusion Between the Median and the Average

Yesterday, I attended an interesting seminar called “Transforming Healthcare through Big Data” at the Providence Health Care Research Institute‘s 2014 Research Day.  The seminar was delivered by Martin Kohn from Jointly Health, and I enjoyed it overall.  However, I noticed a glaring error about basic statistics that needs correction.

Martin wanted to highlight the overconfidence that many doctors have about their abilities, and he quoted Vinod Kohsla, the co-founder of Sun Microsystems, who said, “50% of doctors are below average.”  Martin then presented a study showing an absurdly high percentage of doctors who think that they are “above average”.  A Twitter conversation between attendees of a TED conference in San Francisco and Vinod himself confirms this quotation.

The statement “50% of doctors are below average” is wrong in general.  By definition, 50% of any population is below the median, and the median is only equal to the average if the population is symmetric.  (Examples of symmetric probability distributions are the normal distribution and the Student’s t-distribution.)  Vinod meant to say that “50% of doctors are below the median”, and he confirmed this in the aforementioned Twitter conversation; I am disappointed that he justified this mistake by claiming that it would be less understood.  I think that a TED audience would know what “median” means, and those who don’t can easily search for its meaning online or in books on their own.

In communicating truth, let’s use the correct vocabulary.

Machine Learning Lesson of the Day – Estimating Coefficients in Linear Gaussian Basis Function Models

Recently, I introduced linear Gaussian basis function models as a suitable modelling technique for supervised learning problems that involve non-linear relationships between the target and the predictors.  Recall that linear basis function models are generalizations of linear regression that regress the target on functions of the predictors, rather than the predictors themselves.  In linear regression, the coefficients are estimated by the method of least squares.  Thus, it is natural that the estimation of the coefficients in linear Gaussian basis function models is an extension of the method of least squares.

The linear Gaussian basis function model is

Y = \Phi \beta + \varepsilon,

where \Phi_{ij} = \phi_j (x_i).  In other words, \Phi is the design matrix, and the element in row i and column j of this design matrix is the i\text{th} predictor being evaluated in the j\text{th} basis function.  (In this case, there is 1 predictor per datum.)

Applying the method of least squares, the coefficient vector, \beta, can be estimated by

\hat{\beta} = (\Phi^{T} \Phi)^{-1} \Phi^{T} Y.

Note that this looks like the least-squares estimator for the coefficient vector in linear regression, except that the design matrix is not X, but \Phi.

If you are not familiar with how \hat{\beta} was obtained, I encourage you to review least-squares estimation and the derivation of the estimator of the coefficient vector in linear regression.

Applied Statistics Lesson of the Day – Notation for Fractional Factorial Designs

Fractional factorial designs use the L^{F-p} notation; unfortunately, this notation is not clearly explained in most textbooks or web sites about experimental design.  I hope that my explanation below is useful.

  • L is the number of levels in each factor; note that the L^{F-p} notation assumes that all factors have the same number of levels.
    • If a factor has 2 levels, then the levels are usually coded as +1 and -1.
    • If a factor has 3 levels, then the levels are usually coded as +1, 0, and -1.
  • F is the number of factors in the experiment
  • p is the number of times that the full factorial design is fractionated by L.  This number is badly explained by most textbooks and web sites that I have seen, because they simply say that p is the fraction – this is confusing, because a fraction has a numerator and a denominator, and p is just 1 number.  To clarify,
    • the fraction is L^{-p}
    • the number of treatments in the fractional factorial design is L^{-p} multiplied by the total possible number of treatments in the full factorial design, which is L^F.

If all L^F possible treatments are used in the experiment, then a full factorial design is used.  If a fractional factorial design is used instead, then L^{-p} denotes the fraction of the L^F treatments that is used.

Most factorial experiments use binary factors (i.e. factors with 2 levels, L = 2).  Thus,

  • if p = 1, then the fraction of treatments that is used is 2^{-1} = 1/2.
  • if p = 2, then the fraction of treatments that is used is 2^{-2} = 1/4.

This is why

  • a 2^{F-1} design is often called a half-fraction design.
  • a 2^{F-2} design is often called a quarter-fraction design.

However, most sources that I have read do not bother to mention that L can be greater than 2; experiments with 3-level factors are less frequent but still common.  Thus, the terms half-fraction design and half-quarter design only apply to binary factors.  If L = 3, then

  • a 3^{F-1} design uses one-third of all possible treatments.
  • a 3^{F-2} design uses one-ninth of all possible treatments.

Inorganic Chemistry Lesson of the Day – Coordination Complexes

A coordination complex is a compound that consists of Lewis bases bonded to a Lewis acid in its centre.  The charge of the complex can be neutral, positive, or negative; if the complex has a positive or a negative charge, then it is called a complex ion.  The Lewis acid is almost always a metal atom or a metal ion.  The Lewis bases are called ligands, and they are often covalently bonded to the Lewis acid.  Common ligands include carbon monoxide, water, and ammonia; what unifies them is the existence of at least one lone pair of electrons in their outermost energy level, and this lone pair of electrons is donated to the Lewis acid.

Some key terminology:

  • The donor atom is the atom within the ligand that is attached to the Lewis acid centre.
  • The coordination number is the number of donor atoms in the coordination complex.
  • The denticity of a ligand is the number of bonds that it forms with the Lewis acid centre.
    • If a ligand forms 1 bond with the Lewis acid centre, then it is monodentate (sometimes called unidentate).
    • If a ligand forms multiple bonds with the Lewis acid centre, then the coordination complex is polydentate.  For example, a bidentate ligand forms 2 bonds with the Lewis acid centre.

In later Inorganic Chemistry Lessons of the Day, I will only refer to coordination complexes with metal atoms or metal ions as the Lewis acid centres.

Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

  • n random variables, X_1, ..., X_n, are independent and identically distributed,
  • n is sufficiently large,

then the distribution of their sample mean, \bar{X_n}, is approximately normal, and this approximation is better as n increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of X_1, ..., X_n.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

You can also watch the video below the fold!

Read more of this post

Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R


A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations.  For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

triple box plots with patterns

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Read more of this post

Machine Learning Lesson of the Day – Linear Gaussian Basis Function Models

I recently introduced the use of linear basis function models for supervised learning problems that involve non-linear relationships between the predictors and the target.  A common type of basis function for such models is the Gaussian basis function.  This type of model uses the kernel of the normal (or Gaussian) probability density function (PDF) as the basis function.

\phi_j(x) = exp[-(x - \mu_j)^2 \div 2\sigma^2]

The \sigma in this basis function determines the spacing between the different basis functions that combine to form the model.

Notice that this is just the normal PDF without the scaling factor of 1/\sqrt{2\pi \sigma^2}; the scaling factor ensures that the normal PDF integrates to 1 over its support set.  In a linear basis function model, the regression coefficients are the weights for the basis functions, and these weights will scale Gaussian basis functions to fit the data that are local to \mu_j.  Thus, there is no need to include that scaling factor of 1/\sqrt{2\pi \sigma^2}, because the scaling is already being handled by the regression coefficients.

The Gaussian basis function model is useful because

  • it can model many non-linear relationships between the predictor and the target surprisingly well,
  • each basis function is non-zero over a very small interval and is zero everywhere else.  These local basis functions result in a very sparse design matrix (i.e. one with mostly zeros) that leads to much faster computation.

Applied Statistics Lesson of the Day – Fractional Factorial Design and the Sparsity-of-Effects Principle

Consider again an experiment that seeks to determine the causal relationships between G factors and the response, where G > 1.  Ideally, the sample size is large enough for a full factorial design to be used.  However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead.  Such a design assigns the experimental units to a select fraction of the treatments; these treatments are chosen carefully to investigate the most significant causal relationships, while leaving aside the insignificant ones.  

When, then, are the significant causal relationships?  According to the sparsity-of-effects principle, it is unlikely that complex, higher-order effects exist, and that the most important effects are the lower-order effects.  Thus, assign the experimental units so that main (1st-order) effects and the 2nd-order interaction effects can be investigated.  This may neglect the discovery of a few significant higher-order effects, but that is the compromise that a fractional factorial design makes when the sample size available is low and the number of possible treatments is high.  

Apologies for Illness-Induced Absence

Dear Readers of The Chemical Statistician,

I apologize for the lack of posts in the past while.  I have been struck by a minor but persistent flu, and I have not fully recovered.  Thank you for your continued readership and viewership, and I will return to blogging as soon as I am completely healthy again.