Eric’s Enlightenment for Friday, May 22, 2015

  1. John Urschel (academically published mathematician and NFL football player) uses logistic regression, expected value and variance to anticipate that the new farther distance for the extra-point conversion will not reduce its use in the NFL.
  2. John Ioannidis is widely known for his 2005 paper “Why most published research findings are false“.  In 2014, he wrote another paper on the same topic called “How to Make More Published Research True“.
  3. Yoshitaka Fujii holds the record for the number of retractions of academic publications for a single author: 183 papers, or “roughly 7 percent of all retracted papers between 1980 and 2011″.
  4. The chemistry of why bread stales, and how to slow retrogradation.

Eric’s Enlightenment for Wednesday, May 20, 2015

  1. A common but bad criticism of basketball analytics is that statistics cannot capture the effect of teamwork when assessing the value of a player.  Dan Rosenbaum wrote a great article on how adjusted plus/minus accomplishes this goal.
  2. Citing Dan’s work above, Neil Paine used adjusted plus/minus (APM) to show why Jason Collins was one of the top defensive centres in the NBA and the most underrated player of the last 15 years of his career.  When Neil mentions regularized APM (RAPM) in the third-to-last paragraph, he calls it a Bayesian version of APM.  Most statisticians are more familiar with the term ridge regression, which is one type of regression that penalizes the inclusion of too many redundant predictors.  Make sure to check out that great plot of actual RAPM vs. expected PER at the bottom of the article.
  3. In a 33-page article that was published on 2015-05-14 in Physical Review Letters, only the first 9 pages describes the research done for the article; the other 24 pages were used to list its 5,514 authors – setting a record for the largest known number of authors for a single research article.  Hyperauthorship is common in physics, but not – apparently – in biology.  (Hat Tip: Tyler Cowen)
  4. Brandon Findlay explains why methanol/water mixtures make great cooling baths.  He wrote a very thorough follow-up blog post on how to make them, and he includes photos to aid the demonstration.

Eric’s Enlightenment for Friday, May 15, 2015

  1. An infographic compares R and Python for statistics, data analysis, and data visualization – in a lot of detail!
  2. Psychologist Brian Nosek tackles human biases in science – including motivated reasoning and confirmation bias – long but very worthwhile to read.
  3. Scott Sumner’s wife documents her observations of Beijing during her current trip – very interesting comparisons of how normal life has changed rapidly over the past 10 years.
  4. Is hot air or hot water more effective at melting a frozen pipe – a good answer based on heat capacity and heat resistivity ensues.

Eric’s Enlightenment for Tuesday, May 12, 2015

  1. A great list of public data sets on GitHub – most are free.
  2. Is the 4% withdrawal rule still effective for determining how much you can spend to attain perpetual retirement?
  3. Jeff Leek compiled a great list of awesome things that people did in statistics in 2014.  Here is his list for 2013.  (Hat Tip: Cici Chen and R-Bloggers)
  4. A video demonstration of the triple point of tert-butyl alcohol.

Eric’s Enlightenment for Wednesday, May 6, 2015

  1. Moldova has mysteriously lost one-eighth of its GDP, possibly to fraudulent loans.
  2. Kai Brothers was diagnosed with HIV in 1989, but did not show any symptoms for 25 years.  Does he have a natural defense against HIV?  Now that he is starting to show symptoms, should he start taking anti-retroviral drugs and deny scientists the chance to look for that natural defense in his blood?
  3. Use the VVALUE function in SAS to convert formatted values of a variable into new values of a character variable.
  4. Alex Reinhart diligently compiled and explained a list of major “egregious statistical fallacies regularly committed in the name of science”.  Check them out on his web site and in his book entitled “Statistics Done Wrong“.  I highly recommend reading the section entitled “The p value and the base rate fallacy“.

Eric’s Enlightenment for Thursday, April 30, 2015

  1. Simon Jackman from Stanford University provides some simple examples of obtaining the posterior distribution using conjugate priors.  If you are new to Bayesian statistics and need to develop the intuition for the basic ideas, then work through the math in these examples with pen and paper.
  2. Did you know that there are plastics that conduct electricity?  In fact, Alan J. Heeger, Alan G. MacDiarmid and Hideki Shirakawa won the 2000 Nobel Prize in Chemistry for the work on this fascinating subject.
  3. Jared Niemi provides a nice video introduction of mixed-effects models.  I highly encourage you to work through the math with pen and paper.
  4. Alberto Cairo adds a healthy dose of caution about the recent advent of data-driven journalism.  He emphasizes problems like confusing correlation with causation, ecological fallacies, and drawing conclusions based on small sample sizes or unrepresentative samples.

Eric’s Enlightenment for Wednesday, April 29, 2015

  1. Anscombe’s quartet is a collection of 4 data sets that have almost identical summary statistics but appear very differently when plotted.  They illustrate the importance of visualizing your data first before plugging them into a statistical model.
  2. A potential geochemical explanation for the existence of Blood Falls, an outflow of saltwater tainted with iron (III) oxide at the snout of the Taylor Glacier in Antarctica.  Here is the original Nature paper by Jill Mikucki et al.
  3. Jonathan Rothwell and Siddharth Kulkarni from the Brookings Institution use a value-added approach to rank 2-year and 4-year post-secondary institutions in the USA.  Some of the top-ranked universities by this measure are lesser known schools like Colgate University, Rose-Hulman Institute of Technology, and Carleton College.  I would love to see something similar for Canada!
  4. Heather Krause from Datassist provides tips on how to avoid (accidentally) lying with your data.  Do read the linked sources of further information!

Career Seminar at Department of Statistics and Actuarial Science, Simon Fraser University: 1:30 – 2:20 pm, Friday, February 20, 2015

I am very pleased to be invited to speak to the faculty and students in the Department of Statistics and Actuarial Science at Simon Fraser University on this upcoming Friday.  I look forward to sharing my career advice and answering questions from the students about how to succeed in a career in statistics.  If you will attend this seminar, please feel free to come and say “Hello”!

Eric Cai - Official Head Shot

Read more of this post

The advantages of using count() to get N-way frequency tables as data frames in R


I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R.  Several commenters provided alternative ways of doing so, and they are all appreciated.  Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().


2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation.  Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
 cyl      3     4     5
 4        1     8     2
 6        2     4     1
 8        12    0     2

Read more of this post

Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF)* for X_1, X_2, ..., X_n.


t = T(\mathbf{X})

be a statistic based on \mathbf{X}.


E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,

implies that

P \{g[T(\mathbf{X})]\} = 0] = 1,

then T(\mathbf{X}) is said to be complete.  To deconstruct this esoteric mathematical statement,

  1. let g(t) be a measurable function
  2. if you want to use g[T(\mathbf{X})] to form an unbiased estimator of the zero function,
  3. and if the only such function is almost surely equal to the zero function,
  4. then T(\mathbf{X}) is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

Performing Logistic Regression in R and SAS


My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

  1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
  2. treatment – Whether or not the patient completed an anger control treatment program.
  3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Read more of this post

Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, X_1, X_2, ..., X_n, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all n variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, T(\textbf{X}), such that

  1. T(\textbf{X}) is a sufficient statistic
  2. if U(\textbf{X}) is any other sufficient statistic, then there exists a function g such that

T(\textbf{X}) = g[U(\textbf{X})].

Note that, if there exists a one-to-one function h such that

T(\textbf{X}) = h[U(\textbf{X})],

then T(\textbf{X}) and U(\textbf{X}) are equivalent.

Mathematical Statistics Lesson of the Day – Sufficient Statistics

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF)* for X_1, X_2, ..., X_n.


t = T(\mathbf{X})

be a statistic based on \mathbf{X}.  Let g_\theta(t) be the PDF for T(X).

If the conditional PDF

h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]

is independent of \theta, then T(\mathbf{X}) is a sufficient statistic for \theta.  In other words,

h_\theta(\mathbf{X}) = h(\mathbf{X}),

and \theta does not appear in h(\mathbf{X}).

Intuitively, this means that T(\mathbf{X}) contains everything you need to estimate \theta, so knowing T(\mathbf{X}) (i.e. conditioning f_\theta(x) on T(\mathbf{X})) is sufficient for estimating \theta.

Often, the sufficient statistic for \theta is a summary statistic of X_1, X_2, ..., X_n, such as their

  • sample mean
  • sample median – removed thanks to comment by Christian Robert (Xi’an)
  • sample minimum
  • sample maximum

If such a summary statistic is sufficient for \theta, then knowing this one statistic is just as useful as knowing all n data for estimating \theta.

*This above definition holds for discrete and continuous random variables.

Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS


A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

Read more of this post

Online index of plots and corresponding R scripts

Dear Readers of The Chemical Statistician,

Joanna Zhao, an undergraduate researcher in the Department of Statistics at the University of British Columbia, produced a visual index of over 100 plots using ggplot2, the R package written by Hadley Wickham.

An example of a plot and its source R code on Joanna Zhao's catalogue.

An example of a plot and its source R code on Joanna Zhao’s catalog.

Click on a thumbnail of any picture in this catalog – you will see the figure AND all of the necessary code to reproduce it.  These plots are from Naomi Robbins‘ book “Creating More Effective Graphs”.

If you

  • want to produce an effective plot in R
  • roughly know what the plot should look like
  • but could really use an example to get started,

then this is a great resource for you!  A related GitHub repository has the code for ALL figures and the infrastructure for Joanna’s Shiny app.

I learned about this resource while working in my job at the British Columbia Cancer Agency; I am fortunate to attend a wonderful seminar series on statistics at the British Columbia Centre for Disease Control, and a colleague from this seminar told me about it.  By sharing this with you, I hope that it will immensely help you with your data visualization needs!

Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function f(x) that is continuous on the interval [x_1, x_2], where x_1 and x_2 are any 2 points in the domain of f(x).  Let

x_m = 0.5x_1 + 0.5x_2

be the midpoint of x_1 and x_2.  Then, if

f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),

then f(x) is defined to be midpoint convex.

More generally, let’s consider any point within the interval [x_1, x_2].  We can denote this arbitrary point as

x_\lambda = \lambda x_1 + (1 - \lambda)x_2, where 0 < \lambda < 1.

Then, if

f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be convex.  If

f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),

then f(x) is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable Y with a finite expected value and for any convex function g(y),

E[g(Y)] \geq g[E(Y)].

A function f(x) is defined to be concave if -f(x) is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable Z with a finite expected value and for any concave function h(z),

E[h(Z)] \leq h[E(Z)].

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of n independent and identically distributed random variables, X_1, X_2, ..., X_n,

P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.

In other words, the empirical CDF of X_1, X_2, ..., X_n converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

P[|X - \mu| \geq k \sigma] \leq 1 \div k^2

Markov’s inequality roughly says that a random variable X is most frequently observed near its expected value, \mu.  Remarkably, it quantifies just how often X is far away from \mu.  Chebyshev’s inequality goes one step further and quantifies that distance between X and \mu in terms of the number of standard deviations away from \mu.  It roughly says that the probability of X being k standard deviations away from \mu is at most k^{-2}.  Notice that this upper bound decreases as k increases – confirming our intuition that it is highly improbable for X to be far away from \mu.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable X, as long as E(X) and V(X) are finite.  (Markov’s inequality requires only E(X) to be finite.)  This is quite a marvelous result!


Get every new post delivered to your Inbox.

Join 488 other followers