## Sorting correlation coefficients by their magnitudes in a SAS macro

#### Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, $X$ and $Y$, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations.  Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

1. computing the correlation coefficients
2. obtaining their absolute values
3. sorting them by their absolute values.

## University of Toronto Statistical Sciences Union Career Panel

I am delighted to be invited to speak at the University of Toronto Statistical Sciences Union’s first ever Career Panel.  If you plan to attend this event, I encourage you to read my advice columns on career development in advance.  In particular, I strongly encourage you to read the blog post “How to Find a Job in Statistics – Advice for Students and Recent Graduates“.  I will not cover all of the topics in these columns, but you are welcomed to ask questions about them during the question-and-answer period.

Here are the event’s details.

Time: 1 pm to 6 pm

• My session will be held from 5pm to 6 pm.

Date: Saturday, March 25, 2017

Location: Sidney Smith Hall, 100 St. George Street, Toronto, Ontario.

• Sidney Smith Hall is located on the St. George (Downtown) campus of the University of Toronto.
• Update: The seminars will be held in Rooms 2117 and 2118.  I will speak in Room 2117 at 5 pm.

If you will attend this event, please feel free to come and say “Hello”!

## Analytical Chemistry Lesson of the Day – Accuracy in Method Validation and Quality Assurance

In pharmaceutical chemistry, one of the requirements for method validation is accuracy, the ability of an analytical method to obtain a value of a measurement that is close to the true value. There are several ways of assessing an analytical method for accuracy.

1. Compare the value from your analytical method with an established or reference method.
2. Use your analytical method to obtain a measurement from a sample with a known quantity (i.e. a reference material), and compare the measured value with the true value.
3. If you don’t have a reference material for the second way, you can make your own by spiking a blank matrix with a measured quantity of the analyte.
4. If your matrix may interfere with the analytical signal, then you cannot spike a blank matrix as described in the third way.  Instead, spike your sample with an known quantity of the standard.  I elaborate on this in a separate tutorial on standard addition, a common technique in analytical chemistry for determining the quantity of a substance when matrix interference exists.  Standard addition is an example of the second way of assessing accuracy as I mentioned above.  You can view the original post of this tutorial on the official JMP blog.

## New Job as Data Science Consultant at Environics Analytics!

I am very excited to start a new job as a Data Science Consultant at Environics Analytics (EA)!  My new position is a dual role of data scientist and consultant – I will meet clients regularly to advise them on statistical modelling, data analysis, and marketing analytics, and I will conduct statistical research on a variety of problems.  I look forward to learning about and working in EA’s specialty of geodemography, as well as researching new areas of data analytics, such as text mining and sentiment analysis.

I began my new job in mid-August, and I have enjoyed meeting my new co-workers and serving my first client so far.  I have opened a second Twitter feed, @EricCaiEA, to share my work at my new company.  You can continue to find me on my own Twitter feed for The Chemical Statistician, @chemstateric.

## Analyst Finder – A Free Job-Matching Service for Statisticians, Data Scientists, Database Managers and Data Analysts

If you are a statistician, data scientist, database manager, or data analyst, then consider using Analyst Finder for your next job search.  It is a web site that connects job seekers in data analytics with employers.  The service is free for job seekers, and it earns money by charging companies and recruiters a small fee to find qualified candidates through its job-matching service.

To register for this service as a job seeker, you simply need to complete a check list of skills and preferences.  It’s quick and easy to do, and you can change this list whenever your wish to update your qualifications.

Employers and recruiters can register for this service to search for qualified candidates, and the fees are displayed on the web site.

This company was founded by Art Tabachneck, a former president of the Toronto Area SAS Society and a veteran analytics professional.  In case you’re interested in learning more about him, SAS has a profile about him in recognition of his expertise in SAS programming and his contributions to online support communities and user groups.

## My Alumni Profile by Simon Fraser University – Where Are They Now?

I am happy and grateful to be featured by my alma mater, Simon Fraser University (SFU), in a recent profile.  I answered questions about how my transition from my academic education to my career in statistics and about how blogging and social media have helped me to advance my career.  Check it out!

During my undergraduate degree at SFU, I volunteered at its Career Services Centre for 5 years as a career advisor in its Peer Education program.  I began writing for its official blog, the Career Services Informer (CSI), during that time.  I have continued to write career advice for the CSI as an alumnus, and it is always a pleasure to give back to this wonderful centre!

You can find all of my advice columns here on my blog.

## Career Advice Panel – Statistical Society of Canada’s Annual Student Conference

I am excited to go to Brock University in St. Catharines, Ontario, and speak at the Statistical Society of Canada‘s (SSC’s) Annual Student Conference on Saturday, May 28, 2016!  This one-day conference will be a chance for statistics students from all over Canada to share their research with each other, network with industry professionals, and get career advice from the career advice panel.  I will be one of 3 speakers on this panel, and I look forward to sharing my advice and answering the students’ questions.  Read the Final Program Booklet to get the schedule and learn about the backgrounds of all speakers at this conference.

If you will attend this, conference, please feel free to come and say “Hello”!

This event will occur before the 2016 Annual Conference of the Statistical Society of Canada.

## New Job at the Bank of Montreal in Toronto

I have accepted an offer from the Bank of Montreal to become a Manager of Operational Risk Analytics and Modelling at its corporate headquarter office in Toronto.  Thus, I have resigned from my job at the British Columbia Cancer Agency.  I will leave Vancouver at the end of December, 2015, and start my new job at the beginning of January, 2016.

I have learned some valuable skills and met some great people here in Vancouver over the past 2 years.  My R programming skills have improved a lot, especially in text processing.  My SAS programming skills have improved a lot, and I began a new section on my blog to SAS programming as a result of what I learned.  I volunteered and delivered presentations for the Vancouver SAS User Group (VanSUG) – once on statistical genetics, and another on sampling strategies in analytical chemistry, ANOVA, and PROC TRANSPOSE.  I have thoroughly enjoyed meeting some smart and helpful people at the Data Science, Machine Learning, and R Programming Meetups.

I lived in Toronto from 2011 to 2013 while pursuing my Master’s degree in statistics at the  University of Toronto and working as a statistician at Predicum.  I look forward to re-connecting with my colleagues there.

## Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips.  I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance.  This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out!  I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series!  It is a pleasure to be a guest blogger for JMP!

## Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

I am very excited to write again for the official JMP blog as a guest blogger!  Today, the first article of a 2-part series has been published, and it is called “Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP“.  This series of blog posts will talk about analysis of variance (ANOVA), sampling, and analytical chemistry, and it uses the quantification of sodium in potato chips as an example to illustrate these concepts.

The first part of this series discusses how to import the data into the JMP and prepare them for ANOVA.  Specifically, it illustrates how the “Stack Columns” function is used to transpose the data from wide format to long format.

I will present this at the Vancouver SAS User Group (VanSUG) meeting later today.

## Vancouver SAS User Group Meeting – Wednesday, November 4, 2015

I am excited to present at the next Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.  I will illustrate data transposition and ANOVA in SAS and JMP using potato chips and analytical chemistry.  Come and check it out!  The following agenda contains all of the presentations, and you can register for this meeting on the SAS Canada web site.  This meeting is free, and a free breakfast will be served in the morning.

Update: My slides from this presentation have been posted on the VanSUG web site.

Date: Wednesday, November 4, 2015

Place:

Ballroom West and Centre

Holiday Inn – Vancouver Centre

V5Z 3Y2

(604) 879-0511

Agenda:

8:30am – 9:00am: Registration

9:00am – 9:20am: Introductions and SAS Update – Matt Malczewski, SAS Canada

9:20am – 9:40am: Lessons On Transposing Data, Sampling & ANOVA in SAS & JMP – Eric Cai, Cancer Surveillance & Outcomes, BC Cancer Agency

10:20am – 10:30am: A Beginner’s Experience Using SAS – Kim Burrus, Cancer Surveillance & Outcomes, BC Cancer Agency

10:30am – 11:00am: Networking Break

11:00am – 11.20am: Using SAS for Simple Calculations – Jay Shurgold, Rick Hansen Institute

11:20am – 11:50am: Yes, We Can… Save SAS Formats – John Ladds, Statistics Canada

11:50am – 12:20pm: Reducing Customer Attrition with Predictive Analytics – Nate Derby, Stakana Analytics

12:20pm – 12:30pm: Evaluations, Prize Draw & Closing Remarks

If you would like to be notified of upcoming SAS User Group Meetings in Vancouver, please subscribe to the Vancouver SAS User Group Distribution List.

## How to Ask for Reference Letters From Your Professors

This following article was published on the Career Services Informer (CSI), the official career blog of Simon Fraser University (SFU).  I have been fortunate to be a guest blogger for the CSI since I was an undergraduate student at SFU, and you can read all of my recent articles as an alumnus here.

Image courtesy of Frank C. Müller on Wikimedia

I recently blogged about fast-approaching deadlines for professional programs and graduate studies. Applying to those programs and scholarships requires reference letters from professors, and – having done so as a student at SFU – I have learned that this task is far more intense than simply sending a quick email. Here are some tips for how to make it easier for your professors to write the best reference letters for you.

## SFU Statistics and Actuarial Science Gala – Wednesday, September 16, 2015

I look forward to attending the #SFU50 Gala at the Department of Statistics and Actuarial Science at Simon Fraser University on Wednesday, September 16, 2015.  There will be a poster presentation of undergraduate case studies, a short awards ceremony, and many opportunities to network with current and former students, professors and staff from that department.  If you will attend this event, please come and say “Hello”!

Time: 5:00 – 7:30 pm

Date: Wednesday, September 16, 2015

Place: Applied Sciences Building Atrium, Simon Fraser University, Burnaby, British Columbia, Canada

## Vancouver Python Day @ Mobify Vancouver – Saturday, September 12, 2015

I am excited to attend Vancouver Python Day on Saturday, September 12, 2015, at Mobify.  Learn about algorithmic trading, the Python Data Toolkit, using Python on mobile devices, and more! The conference is free to attend.  If you will go to this conference, then please feel free to come and say “Hello”!

Vancouver Python Day
Saturday September 12, 2015
9:30AM – 4:00PM

Mobify Vancouver
#300 – 948 Homer St, Vancouver, BC

Scheduled Presentations

Keynote: The State of Mobile Python
Russell Keith-Magee, Django Software Foundation

Socializing and Networking for Awkward Humans
Carly Slater

Spreading Python Skills to the Scientific Community
Bill Mills, Mozilla Science Lab

Using the Python Data Toolkit: A Live Demo!
Tiffany Timbers

Interesting New Features in Python 3.5
Brett Cannon, Microsoft

Simon Thornington

See the agenda for the full schedule.

Lightning Talks

Time will be provided for Python and Django lightning talks. Sign up will be on site.

This following article was published on the Career Services Informer (CSI), the official career blog of Simon Fraser University (SFU).  I have been fortunate to be a guest blogger for the CSI since I was an undergraduate student at SFU, and you can read all of my recent articles as an alumnus here.

As most students return to school in the upcoming semester, their academic studies and back-to-school logistics may be their top priorities.   However, if you want to pursue graduate studies or professional programs like medicine or law, then there are some important deadlines that are fast approaching, and they all involve time-consuming efforts to meet them. Now is a good time to tackle these deadlines and put forth your best effort while you are free of the burdens of exams and papers that await you later in the fall semester.

Image Courtesy of Melburnian at Wikimedia

Speaking from experience, these applications are very long and tiring, and they will take a lot of thought, planning, writing and re-writing. They also require a lot of coordination to get the necessary documents, like your transcripts and letters of recommendation from professors who can attest to your academic accomplishments and research potential.  Plan ahead for them accordingly, and consider using the Career Services Centre to help you with drafting your curriculum vitae, your statements of interest, and any interview preparation.

## Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

$P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]$

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

$P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)$

$P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757$

Now, let’s connect this probability to odds.

It is important to note that

• odds is only applicable to Bernoulli random variables (i.e. binary events)
• odds is the ratio of the probability of success to the probability of failure

For our example,

$\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185$

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

## Analytical Chemistry Lesson of the Day – Linearity in Method Validation and Quality Assurance

In analytical chemistry, the quantity of interest is often estimated from a calibration line.  A technique or instrument generates the analytical response for the quantity of interest, so a calibration line is constructed from generating multiple responses from multiple standard samples of known quantities.  Linearity refers to how well a plot of the analytical response versus the quantity of interest follows a straight line.  If this relationship holds, then an analytical response can be generated from a sample containing an unknown quantity, and the calibration line can be used to estimate the unknown quantity with a confidence interval.

Note that this concept of “linear” is different from the “linear” in “linear regression” in statistics.

This is the the second blog post in a series of Chemistry Lessons of the Day on method validation in analytical chemistry.  Read the previous post on specificity, and stay tuned for future posts!

## Mathematical Statistics Lesson of the Day – Basu’s Theorem

Today’s Statistics Lesson of the Day will discuss Basu’s theorem, which connects the previously discussed concepts of minimally sufficient statistics, complete statistics and ancillary statistics.  As before, I will begin with the following set-up.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF) or probability mass function (PMF) for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistics based on $\textbf{X}$.

Basu’s theorem states that, if $T(\textbf{X})$ is a complete and minimal sufficient statistic, then $T(\textbf{X})$ is independent of every ancillary statistic.

Establishing the independence between 2 random variables can be very difficult if their joint distribution is hard to obtain.  This theorem allows the independence between minimally sufficient statistic and every ancillary statistic to be established without their joint distribution – and this is the great utility of Basu’s theorem.

However, establishing that a statistic is complete can be a difficult task.  In a later lesson, I will discuss another theorem that will make this task easier for certain cases.

## Analytical Chemistry Lesson of the Day – Specificity in Method Validation and Quality Assurance

In pharmaceutical chemistry, one of the requirements for method validation is specificity, the ability of an analytical method to distinguish the analyte from other chemicals in the sample.  The specificity of the method may be assessed by deliberately adding impurities into a sample containing the analyte and testing how well the method can identify the analyte.

Statistics is an important tool in analytical chemistry, and, ideally, there is no overlap in the vocabulary that is used between the 2 fields.  Unfortunately, the above definition of specificity is different from that in statistics.  In a previous Machine Learning and Applied Statistics Lesson of the Day, I introduced the concepts of sensitivity and specificity in binary classification.  In the context of assessing the predictive accuracy of a binary classifier, its specificity is the proportion of truly negative cases among the classified negative cases.

## Mathematical Statistics Lesson of the Day – An Example of An Ancillary Statistic

Consider 2 random variables, $X_1$ and $X_2$, from the normal distribution $\text{Normal}(\mu, \sigma^2)$, where $\mu$ is unknown.  Then the statistic

$D = X_1 - X_2$

has the distribution

$\text{Normal}(0, 2\sigma^2)$.

The distribution of $D$ does not depend on $\mu$, so $D$ is an ancillary statistic for $\mu$.

Note that, if $\sigma^2$ is unknown, then $D$ is not ancillary for $\sigma^2$.