Christopher Salahub on Markov Chains – The Central Equilibrium – Episode 2

It was a great pleasure to talk to Christopher Salahub about Markov chains in the second episode of my new talk show, The Central Equilibrium!  Chris graduated from the University of Waterloo with a Bachelor of Mathematics degree in statistics.  He just finished an internship in data development at Environics Analytics, and he is starting a Master’s program in statistics at ETH Zurich in Switzerland.

Chris recommends “Introduction to Probability Models” by Sheldon Ross to learn more about probability theory and Markov chains.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams.  Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please view this blog post beneath the fold to watch the video on this blog.  You can also watch it directly on YouTube.

Read more of this post


Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting.  I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems.  This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data setI will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements.  To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data =
     out = bvars (keep = name type);

Read more of this post

A Comprehensive Guide for Public Speaking at Scientific Conferences


I served as a judge for some of the student presentations at the 2016 Canadian Statistics Student Conference (CSSC).  The conference was both a learning opportunity and a networking opportunity for statistics students in Canada.  The presentations allowed the students to share their research and course projects with their peers, and it was a chance for them to get feedback about their work and learn new ideas from other students.

Unfortunately, I found most of the presentations to be very bad – not necessarily in terms of the content, but because of the delivery.  Although the students showed much earnestness and eagerness in sharing their work with others, most of them demonstrated poor competence in public speaking.

Public speaking is an important skill in knowledge-based industries, so these opportunities are valuable experiences for anybody to strengthen this skill.  You can only learn it by doing it many times, making mistakes, and learning from those mistakes.  Having delivered many presentations, learned from my share of mistakes, and received much praise for my seminars, I hope that the following tips will help anyone who presents at scientific conferences to improve their public-speaking skills.  In fact, most of these tips apply to public speaking in general.

I spoke at the 2016 Canadian Statistics Student Conference on career advice for students and new graduates in statistics.

Image courtesy of Peter Macdonald on Flickr.

Read more of this post

My Alumni Profile by Simon Fraser University – Where Are They Now?

I am happy and grateful to be featured by my alma mater, Simon Fraser University (SFU), in a recent profile.  I answered questions about how my transition from my academic education to my career in statistics and about how blogging and social media have helped me to advance my career.  Check it out!

During my undergraduate degree at SFU, I volunteered at its Career Services Centre for 5 years as a career advisor in its Peer Education program.  I began writing for its official blog, the Career Services Informer (CSI), during that time.  I have continued to write career advice for the CSI as an alumnus, and it is always a pleasure to give back to this wonderful centre!

You can find all of my advice columns here on my blog.


New Job at the Bank of Montreal in Toronto

I have accepted an offer from the Bank of Montreal to become a Manager of Operational Risk Analytics and Modelling at its corporate headquarter office in Toronto.  Thus, I have resigned from my job at the British Columbia Cancer Agency.  I will leave Vancouver at the end of December, 2015, and start my new job at the beginning of January, 2016.

I have learned some valuable skills and met some great people here in Vancouver over the past 2 years.  My R programming skills have improved a lot, especially in text processing.  My SAS programming skills have improved a lot, and I began a new section on my blog to SAS programming as a result of what I learned.  I volunteered and delivered presentations for the Vancouver SAS User Group (VanSUG) – once on statistical genetics, and another on sampling strategies in analytical chemistry, ANOVA, and PROC TRANSPOSE.  I have thoroughly enjoyed meeting some smart and helpful people at the Data Science, Machine Learning, and R Programming Meetups.

I lived in Toronto from 2011 to 2013 while pursuing my Master’s degree in statistics at the  University of Toronto and working as a statistician at Predicum.  I look forward to re-connecting with my colleagues there.

Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips.  I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance.  This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out!  I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series!  It is a pleasure to be a guest blogger for JMP!



Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

I am very excited to write again for the official JMP blog as a guest blogger!  Today, the first article of a 2-part series has been published, and it is called “Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP“.  This series of blog posts will talk about analysis of variance (ANOVA), sampling, and analytical chemistry, and it uses the quantification of sodium in potato chips as an example to illustrate these concepts.

The first part of this series discusses how to import the data into the JMP and prepare them for ANOVA.  Specifically, it illustrates how the “Stack Columns” function is used to transpose the data from wide format to long format.

I will present this at the Vancouver SAS User Group (VanSUG) meeting later today.

Stay tuned for “Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“!



Vancouver SAS User Group Meeting – Wednesday, November 4, 2015

I am excited to present at the next Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.  I will illustrate data transposition and ANOVA in SAS and JMP using potato chips and analytical chemistry.  Come and check it out!  The following agenda contains all of the presentations, and you can register for this meeting on the SAS Canada web site.  This meeting is free, and a free breakfast will be served in the morning.


Update: My slides from this presentation have been posted on the VanSUG web site.


Date: Wednesday, November 4, 2015


Ballroom West and Centre

Holiday Inn – Vancouver Centre

711 West Broadway, Vancouver, BC

V5Z 3Y2

(604) 879-0511


8:30am – 9:00am: Registration

9:00am – 9:20am: Introductions and SAS Update – Matt Malczewski, SAS Canada

9:20am – 9:40am: Lessons On Transposing Data, Sampling & ANOVA in SAS & JMP – Eric Cai, Cancer Surveillance & Outcomes, BC Cancer Agency

9.40am – 10.20am: Make SAS Enterprise Guide Your Own – John Ladds, Statistics Canada

10:20am – 10:30am: A Beginner’s Experience Using SAS – Kim Burrus, Cancer Surveillance & Outcomes, BC Cancer Agency

10:30am – 11:00am: Networking Break

11:00am – 11.20am: Using SAS for Simple Calculations – Jay Shurgold, Rick Hansen Institute

11:20am – 11:50am: Yes, We Can… Save SAS Formats – John Ladds, Statistics Canada

11:50am – 12:20pm: Reducing Customer Attrition with Predictive Analytics – Nate Derby, Stakana Analytics

12:20pm – 12:30pm: Evaluations, Prize Draw & Closing Remarks

If you would like to be notified of upcoming SAS User Group Meetings in Vancouver, please subscribe to the Vancouver SAS User Group Distribution List.

SFU Statistics and Actuarial Science Gala – Wednesday, September 16, 2015

I look forward to attending the #SFU50 Gala at the Department of Statistics and Actuarial Science at Simon Fraser University on Wednesday, September 16, 2015.  There will be a poster presentation of undergraduate case studies, a short awards ceremony, and many opportunities to network with current and former students, professors and staff from that department.  If you will attend this event, please come and say “Hello”!


Time: 5:00 – 7:30 pm

Date: Wednesday, September 16, 2015

Place: Applied Sciences Building Atrium, Simon Fraser University, Burnaby, British Columbia, Canada

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

odds of all 15 home teams winning on same day

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)

P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757

Now, let’s connect this probability to odds.

It is important to note that

  • odds is only applicable to Bernoulli random variables (i.e. binary events)
  • odds is the ratio of the probability of success to the probability of failure

For our example,

\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})

\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)

\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

Analytical Chemistry Lesson of the Day – Linearity in Method Validation and Quality Assurance

In analytical chemistry, the quantity of interest is often estimated from a calibration line.  A technique or instrument generates the analytical response for the quantity of interest, so a calibration line is constructed from generating multiple responses from multiple standard samples of known quantities.  Linearity refers to how well a plot of the analytical response versus the quantity of interest follows a straight line.  If this relationship holds, then an analytical response can be generated from a sample containing an unknown quantity, and the calibration line can be used to estimate the unknown quantity with a confidence interval.

Note that this concept of “linear” is different from the “linear” in “linear regression” in statistics.

This is the the second blog post in a series of Chemistry Lessons of the Day on method validation in analytical chemistry.  Read the previous post on specificity, and stay tuned for future posts!

Analytical Chemistry Lesson of the Day – Specificity in Method Validation and Quality Assurance

In pharmaceutical chemistry, one of the requirements for method validation is specificity, the ability of an analytical method to distinguish the analyte from other chemicals in the sample.  The specificity of the method may be assessed by deliberately adding impurities into a sample containing the analyte and testing how well the method can identify the analyte.

Statistics is an important tool in analytical chemistry, and, ideally, there is no overlap in the vocabulary that is used between the 2 fields.  Unfortunately, the above definition of specificity is different from that in statistics.  In a previous Machine Learning and Applied Statistics Lesson of the Day, I introduced the concepts of sensitivity and specificity in binary classification.  In the context of assessing the predictive accuracy of a binary classifier, its specificity is the proportion of truly negative cases among the classified negative cases.

Mathematical Statistics Lesson of the Day – An Example of An Ancillary Statistic

Consider 2 random variables, X_1 and X_2, from the normal distribution \text{Normal}(\mu, \sigma^2), where \mu is unknown.  Then the statistic

D = X_1 - X_2

has the distribution

\text{Normal}(0, 2\sigma^2).

The distribution of D does not depend on \mu, so D is an ancillary statistic for \mu.

Note that, if \sigma^2 is unknown, then D is not ancillary for \sigma^2.

Data Science Seminar by David Campbell on Approximate Bayesian Computation and the Earthworm Invasion in Canada

My colleague, David Campbell, will be the feature speaker at the next Vancouver Data Science Meetup on Thursday, June 25.  (This is a jointly organized event with the Vancouver Machine Learning Meetup and the Vancouver R Users Meetup.)  He will present his research on approximate Bayesian computation and Markov Chain Monte Carlo, and he will highlight how he has used these tools to study the invasion of European earthworms in Canada, especially their drastic effects on the boreal forests in Alberta.

Dave is a statistics professor at Simon Fraser University, and I have found him to be very smart and articulate in my communication with him.  This seminar promises to be both entertaining and educational.  If you will attend it, then I look forward to seeing you there!  Check out Dave on Twitter and LInkedIn.

Title: The great Canadian worm invasion (from an approximate Bayesian computation perspective)

Speaker: David Campbell

Date: Thursday, June 25


HootSuite (Headquarters)

5 East 8th Avenue

Vancouver, BC


• 6:00 pm: Doors are open – feel free to mingle!
• 6:30 pm: Presentation begins.
• ~7:45 Off to a nearby restaurant for food, drinks, and breakout discussions.


After being brought in by pioneers for agricultural reasons, European earthworms have been taking North America by storm and are starting to change the Alberta Boreal forests. This talk uses an invasive species model to introduce the basic ideas behind estimating the rate of new worm introductions and how quickly they spread with the goal of predicting the future extent of the great Canadian worm invasion. To take on the earthworm invaders, we turn to Approximate Bayesian Computation methods. Bayesian statistics are used to gather and update knowledge as new information becomes available owing to their success in prediction and estimating ongoing and evolving processes. Approximate Bayesian Computation is a step in the right direction when it’s just not possible to actually do the right thing- in this case using the exact invasive species model is infeasible. These tools will be used within a Markov Chain Monte Carlo framework.

About Dave Campbell:

Dave Campbell is an Associate Professor in the Department of Statistics and Actuarial Science at Simon Fraser University and Director of the Management and Systems Science Program. Dave’s main research area is at the intersections of statistics with computer science, applied math, and numerical analysis. Dave has published papers on Bayesian algorithms, adaptive time-frequency estimation, and dealing with lack of identifiability. His students have gone on to faculty positions and worked in industry at video game companies and predicting behaviour in malls, chat rooms, and online sales.

Mathematical Statistics Lesson of the Day – Ancillary Statistics

The set-up for today’s post mirrors my earlier Statistics Lessons of the Day on sufficient statistics and complete statistics.

Suppose that you collected data

\mathbf{X} = X_1, X_2, ..., X_n

in order to estimate a parameter \theta.  Let f_\theta(x) be the probability density function (PDF) or probability mass function (PMF) for X_1, X_2, ..., X_n.


a = A(\mathbf{X})

be a statistics based on \textbf{X}.

If the distribution of A(\textbf{X}) does NOT depend on \theta, then A(\textbf{X}) is called an ancillary statistic.

An ancillary statistic contains no information about \theta; its distribution is fixed and known without any relation to \theta.  Why, then, would we care about A(\textbf{X})  I will address this question in later Statistics Lessons of the Day, and I will connect ancillary statistics to sufficient statistics, minimally sufficient statistics and complete statistics.

Eric’s Enlightenment for Wednesday, June 3, 2015

  1. Jodi Beggs uses the Rule of 70 to explain why small differences in GDP growth rates have large ramifications.
  2. Rick Wicklin illustrates the importance of choosing bin widths carefully when plotting histograms.
  3. Shana Kelley et al. have developed an electrochemical sensor for detecting selected mutated nucleic acids (i.e. cancer markers in DNA!).  “The sensor comprises gold electrical leads deposited on a silicon wafer, with palladium nano-electrodes.”
  4. Rhett Allain provides a very detailed and analytical critique of Mjölnir (Thor’s hammer) – specifically, its unrealistic centre of mass.  This is an impressive exercise in physics!
  5. Congratulations to the Career Services Centre at Simon Fraser University for winning TalentEgg’s Special Award for Innovation by a Career Centre!  I was fortunate to volunteer there as a career advisor for 5 years, and it was a wonderful place to learn, grow and give back to the community. My career has benefited greatly from that experience, and it is a pleasure to continue my involvement as a guest blogger for its official blog, The Career Services Informer. Way to go, everyone!

Eric’s Enlightenment for Monday, June 1, 2015

  1. A comprehensive graphic of public perceptions about chemistry in the United Kingdom – compiled by the Royal Society of Chemistry.  (Hat Tip: Neil Smithers)
  2. Qing Ke et al. compiled a list of “sleeping beauties” in science – articles that were not appreciated at the time of publication and required much passage in time before becoming popular in the scientific community.  (Unfortunately, that original article is gated by subscription.)  As reported in, “the longest sleeper in the top 15 is a statistics paper from Karl Pearson, entitled, ‘On lines and planes of closest fit to systems of points in space‘.  Published in Philosophical Magazine in 1901, this paper awoke only in 2002.”  Out of those top 15 sleeping beauties, 7 were in chemistry.  A full pre-published version of Ke et al.’s paper can be found on arXiv.
  3. What would the Earth’s stratospheric ozone layer look like if the Montreal Protocol was never enacted to ban halocarbon refrigerants, solvents, and aerosol-can propellants?  Using simulations, Martyn Chipperfield et al. “found that the Antarctic ozone hole would have grown by an additional 40% by 2013.”
  4. Jan Hoffman on new challenges in mental health for university students: “Anxiety has now surpassed depression as the most common mental health diagnosis among college students, though depression, too, is on the rise. More than half of students visiting campus clinics cite anxiety as a health concern, according to a recent study of more than 100,000 students nationwide by the Center for Collegiate Mental Health at Penn State.”

Eric’s Enlightenment for Wednesday, May 27, 2015

  1. Why do humans get schizophrenia, but other animals don’t?
  2. At Marginal Revolution, Ramez Naam recently argued that CRISPR (with all of the limitations in some recent research) should not be feared in two blog posts – Part 1 and Part 2.
  3. Ecological fallacies and exception fallacies – two common mistakes in reasoning, statistics and scientific research.
  4. Intrauterine devices (IUDs) are the most effective contraceptives, so why is their usage so low?  Shefali Luthra reports that – at least for teenage girls – pediatricians were not trained to insert them in their education.  Maddie Oatman finds more complicated reasons for women in general.

Eric’s Enlightenment for Friday, May 22, 2015

  1. John Urschel (academically published mathematician and NFL football player) uses logistic regression, expected value and variance to anticipate that the new farther distance for the extra-point conversion will not reduce its use in the NFL.
  2. John Ioannidis is widely known for his 2005 paper “Why most published research findings are false“.  In 2014, he wrote another paper on the same topic called “How to Make More Published Research True“.
  3. Yoshitaka Fujii holds the record for the number of retractions of academic publications for a single author: 183 papers, or “roughly 7 percent of all retracted papers between 1980 and 2011”.
  4. The chemistry of why bread stales, and how to slow retrogradation.

Eric’s Enlightenment for Wednesday, May 20, 2015

  1. A common but bad criticism of basketball analytics is that statistics cannot capture the effect of teamwork when assessing the value of a player.  Dan Rosenbaum wrote a great article on how adjusted plus/minus accomplishes this goal.
  2. Citing Dan’s work above, Neil Paine used adjusted plus/minus (APM) to show why Jason Collins was one of the top defensive centres in the NBA and the most underrated player of the last 15 years of his career.  When Neil mentions regularized APM (RAPM) in the third-to-last paragraph, he calls it a Bayesian version of APM.  Most statisticians are more familiar with the term ridge regression, which is one type of regression that penalizes the inclusion of too many redundant predictors.  Make sure to check out that great plot of actual RAPM vs. expected PER at the bottom of the article.
  3. In a 33-page article that was published on 2015-05-14 in Physical Review Letters, only the first 9 pages describes the research done for the article; the other 24 pages were used to list its 5,514 authors – setting a record for the largest known number of authors for a single research article.  Hyperauthorship is common in physics, but not – apparently – in biology.  (Hat Tip: Tyler Cowen)
  4. Brandon Findlay explains why methanol/water mixtures make great cooling baths.  He wrote a very thorough follow-up blog post on how to make them, and he includes photos to aid the demonstration.