## Using Your Vacation to Develop Your Career – Guest Blogging on Simon Fraser University’s Career Services Informer

The following post was originally published on the Career Services Informer.

I recently took a vacation from my former role as a statistician at the BC Centre for Excellence in HIV/AIDS. I did not plan a trip out of town – the spring weather was beautiful in Vancouver, and I wanted to spend time on the things that I like to do in this city. Many obvious things came to mind – walking along beaches, practicing Python programming and catching up with friends – just to name a few.

Yes, Python programming was one of the obvious things on my vacation to-do list, and I understand how ridiculous this may seem to some people. Why tax my brain during a time that is meant for mental relaxation, especially when the weather is great?

## Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals.  (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.)  Today, I want to share a second video of calculating expected counts – this time, using joint probabilities.  This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table.  Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample.  This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my Youtube channel to gain a deeper understanding of how and why they can be calculated.  Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

You can also watch the video below the fold!

## Mathematical Statistics Lesson of the Day – Markov’s Inequality

Markov’s inequality is an elegant and very useful inequality that relates the probability of an event concerning a non-negative random variable, $X$, with the expected value of $X$.  It states that

$P(X \geq c) \leq E(X) \div c,$

where $c > 0$.

I find Markov’s inequality to be beautiful for 2 reasons:

1. It applies to both continuous and discrete random variables.
2. It applies to any non-negative random variable from any distribution with a finite expected value.

In a later lesson, I will discuss the motivation and intuition behind Markov’s inequality, which have useful implications for understanding a data set.

## Organic and Inorganic Chemistry Lesson of the Day – Racemic Mixtures

A racemic mixture is a mixture that contains equal amounts of both enantiomers of a chiral molecule.  (By amount, I mean the usual unit of quantity in chemistry – the mole.  Of course, since enantiomers are isomers, their molar masses are equal, so a racemic mixture would contain equal masses of both enantiomers, too.)

In synthesizing enantiomers, if a set of reactants combine to form a racemic mixture, then the reactants are called non-stereoselective or non-stereospecific.

in 1895, Otto Wallach proposed that a racemic crystal is more dense than a crystal with purely one of the enantiomers; this is known as Wallach’s rule.  Brock et al. (1991) substantiated this with crystallograhpic data.

Reference:

Brock, C. P., Schweizer, W. B., & Dunitz, J. D. (1991). On the validity of Wallach’s rule: on the density and stability of racemic crystals compared with their chiral counterparts. Journal of the American Chemical Society, 113(26), 9811-9820.

## Applied Statistics Lesson of the Day – The Coefficient of Variation

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set.  However, consider the following 2 hypothetical cases:

1. the standard deviation for the incomes of households in Canada is $2,000 2. the standard deviation for the incomes of the 5 major banks in Canada is$2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas$2,000 is not a lot of money for one of the “Big Five” banks.  Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets.  One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

$\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},$

where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

Analogously, the coefficient of variation for a random variable is

$\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,$

where $\sigma$ is the random variable’s standard deviation and $\mu$ is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes.  I hope that all students new to statistics get to learn this alternative measure of dispersion.

## Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

• its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

$\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}$

• its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

$\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}$

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

## Organic and Inorganic Chemistry Lesson of the Day – Diastereomers

I previously introduced the concept of chirality and how it is a property of any molecule with only 1 stereogenic centre.  (A molecule with $n$ stereogenic centres may or may not be chiral, depending on its stereochemistry.)  I also defined an enantiomer as a molecule with a non-superimposable mirror image.  (Recall that chirality in inorganic chemistry can arise in 2 different ways.)

It is possible for a stereoisomer to NOT have a superimposable mirror image; in fact, such stereoisomers are called diastereomers.  Yes, I recognize that defining something as the negation of something else is unusual.  If you have learned set theory or probability - as I did in my mathematical statistics classes – then consider the set of all pairs of the stereoisomers of one compound with the same chemical formula – this is the sample space.  The enantiomers form a proper subset within this sample space, and the diastereomers are the complement of the enantiomers.

It is important to note that, while diastereomers are not mirror images of each other, they are still non-superimposable.  Diastereomers arise from stereoisomers with 2 or more stereogenic centres; here is an example of how they can arise.

1) Consider a stereoisomer with 2 stereogenic centres and no meso isomers*.  This isomer has $2^{n = 2}$ stereoisomers, where $n = 2$ denotes the number of stereogenic centres.

2) Find one pair of enantiomers based on one of the stereogenic centres.

3) Find the other pair enantiomers based on the other stereogenic centre.

4) Take any one molecule from Step #2 and any one molecule from Step #3.  These cannot be mirror images of each other.  (One molecule cannot have 2 different mirror images of itself.)  These 2 molecules are diastereomers.

Think back to my above description of enantiomers as a proper subset within the sample space of the pairs of one set of stereoisomers.  You can now see why I emphasized that the sample space consists of pairs, since multiple different pairs of stereoisomers can form enantiomers.  In my example above, Steps #2 and #3 produced 2 subsets of enantiomers.  It should be clear by now that enantiomers and diastereomers are defined as pairs.  To further illustrate this point,

a) call the 2 molecules in Step#2 A and B.

b) call the 2 molecules in Step #3 C and D.

A and B are enantiomers.  A and C are diastereomers.  Thus, it is entirely possible for one molecule to be an enantiomer with a second molecule and a diastereomer with a third molecule.

Here is an example of 2 diastereomers.  Notice that they have the same chemical formula but different 3-dimensional orientations – i.e. they are stereoisomers.  These stereoisomers are not mirror images of each other, but they are non-superimposable – i.e. they are diastereomers.

D-Threose

D-Erythrose

Images courtesy of Popnose, DMacks and Edgar181 on Wikimedia.  For brevity, I direct you to the Wikipedia entry for diastereomers showing these 4 images in one panel.

*I will discuss meso isomers in a separate lesson.

## New Job as Biostatistical Analyst at the British Columbia Cancer Agency

Dear Readers and Followers of The Chemical Statistician:

My apologies for the slower than usual posting frequency in the last few months, but I have been very busy preparing for a big transition – after a long and intense selection process that started in March, I was offered a new job as a biostatistical analyst at the British Columbia Cancer Agency (BCCA)!

I was sad to leave many of the kind and friendly co-workers whom I met at the British Columbia Centre for Excellence in HIV/AIDS during my 10 months of working there, but I was very excited to accept this offer and begin working for the BCCA – specifically, in the Cancer Surveillance and Outomces (CSO) Unit.  I had already met several of my new co-workers from past meetings in the Vancouver SAS User Group, and I also know 2 people who worked for long periods in this same group in the past.  From all of these interactions, I got a very positive impression about the professionalism, expertise, and collegiality of this new group, so I was delighted to join this team.

I started my new job 3 weeks ago, and was plunged into 3 projects immediately.  I have been swamped with work right from the start, so I’m still adjusting to my new schedule and surroundings.  Nonetheless, I hope to resume blogging at my usual pace as I settle into my new job.  (I just posted a new video on calculating expected counts in contingency tables using joint and marginal probabilities.)  I also hope to use my work as inspiration for blogging topics here at The Chemical Statistician.

Thank you all for your patience and continued readership.  It has been a pleasure to learn from you, and I hope to continue a successful expansion of The Chemical Statistician for the rest of 2014 and beyond!

Eric

## Video Tutorial – Allelic Frequencies Remain Constant From Generation to Generation Under the Hardy-Weinberg Equilibrium

The Hardy-Weinberg law is a fundamental principle in statistical genetics.  If its 7 assumptions are fulfilled, then it predicts that the allelic frequency of a genetic trait will remain constant from generation to generation.  In this new video tutorial in my Youtube channel, I explain the math behind the Hardy-Weinberg theorem.  In particular, I clarify the origin of the connection between allelic frequencies and genotyopic frequencies in the second generation – I have not found a single textbook or web site on this topic that explains this calculation, so I hope that my explanation is helpful to you.

You can also watch the video below the fold!

## Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables.  The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily.  In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals.  In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

You can also watch the video below the fold!

## Organic and Inorganic Chemistry Lesson of the Day – Stereogenic Centre

A stereogenic centre (often called a stereocentre) is an atom that satisfies 2 conditions:

1. it is bonded to at least 3 substituents.
2. interchanging any 2 of the substituents would result in a stereoisomer.

If a molecule has only 1 stereogenic centre, then it definitely has a non-superimposable mirror image (i.e. this molecule is chiral and is an enantiomer).  However, depending on its stereochemistry, it is possible for a molecule with 2 or more stereogenic centres to be achiral; such molecules are called meso isomers (or meso compounds), and I will discuss them in a later lesson.

In organic chemistry, the stereogenic centre is usually a carbon atom that is attached to 4 substituents in a tetrahedral geometry.  In inorganic chemistry, the stereogenic centre is usually the metal centre of a coordination complex.

In organic chemistry, stereogenic centres with substituents in a tetrahedral geometry are common.  Inorganic coordination complexes can also have a tetrahedral geometry.  A stereoisomer with $n$ tetrahedral stereogenic centres can have at most $2^n$ stereoisomers.  The “at most” caveat is important; as mentioned above, it is possible for a molecule with 2 or more stereogenic centres to have a spatial arrangement that DOES NOT have a non-superimposable mirror image; such isomers are meso isomers.   I will discuss meso isomers in more detail in a later lesson.

## Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

• 5% in the 1st year
• 10% in the 2nd year
• 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of $1 + r$, where $r$ is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, $G$.  The average factor of growth from the 3 years is

$G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n}$,

where $r_i$ is the rate of return in year $i$, $i = 1, 2, 3, ..., n$.  The average annual rate of return is $G - 1$.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

Returning to our example, our average factor of growth is

$G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242$.

Thus, our annual rate of return is $G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%$.

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

1. The first set, $S_1$, consists of your data $x_1, x_2, ..., x_n$, and this set has a sample size of $n$.
2. The second, $S_2$,  set also has a sample size of $n$, but all $n$ values are the same – let’s call this common value $y$.
• What number must $y$ be such that the sums in $S_1$ and $S_2$ are equal?  This value of $y$ is the arithmetic mean of the first set.
• What number must $y$ be such that the products in $S_1$ and $S_2$ are equal?  This value of $y$ is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

## Inorganic Chemistry Lesson of the Day – 2 Different Ways for Chirality to Arise in Coordination Complexes

In a previous Chemistry Lesson of the Day, I introduced chirality and enantiomers in organic chemistry; recall that chirality in organic chemistry often arises from an asymmetric carbon that is attached to 4 different substituents.  Chirality is also observed in coordination complexes in inorganic chemistry.  There are 2 ways for chirality to be observed in coordination complexes:

1.   The metal centre has an asymmetric arrangement of ligands around it.

• This type of chirality can be observed in octahedral complexes and tetrahedral complexes, but not square planar complexes.  (Recall that square planar complexes have a plane formed by the metal and its 4 ligands.  This plane can serve as a plane of reflection, and any mirror image of a square planar complex across this plane is clearly superimposable onto itself, so it cannot have chirality just by having 4 different ligands alone.)

2.   The metal centre has a chiral ligand (i.e. the ligand itself has a non-superimposable mirror image).

• Following the sub-bullet under Point #1, a square planar complex can be chiral if it has a chiral ligand.

## Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

$H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i)$.

For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

1. 60 km/hr for 100 km
2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

$(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]$

$= 46.45 \text{ km/hr}$

Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

$\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}$

## Organic and Inorganic Chemistry Lesson of the Day – Chirality and Enantiomers

In chemistry, chirality is a property of a molecule such that the molecule has a non-superimposable mirror image.  In other words, a molecule is chiral if, upon reflection by any plane, it cannot be superimposed onto itself.

Chirality is a property of the 3-dimensional orientation of a molecule, and molecules exhibiting chirality are stereoisomers.  Specifically, two molecules are enantiomers of each other if they are non-superimposable mirror images of each other.  In organic chemistry, chirality commonly arises out of an asymmetric carbon atom, which is a carbon that is attached to 4 different substituents.  Chirality in inorganic chemistry is more complicated, and I will discuss this in a later lesson.

It is important to note that enantiomers are defined as pairs.  This will be later emphasized in the lesson on diastereomers.

## Useful Options For Every SAS Program – Lessons and Resources from Dr. Jerry Brunner

#### Introduction

Today, I want to share some useful options to put at the beginning of every SAS program that I write.  These options will make the practicality of using SAS much easier.  My applied statistics professor from the University of Toronto, Jerry Brunner*, taught me some of these options when I first learned SAS in our class, and I’m grateful for that.  In later instances of using SAS in team projects, I have met SAS programmers who were delightfully surprised by the existence of these options and desperately wished that they had learned them earlier.  I hope that they will help you with your SAS programming.  I have also learned some useful options by posting questions on the SAS Support Communities online forum.

#### Clearing Output

After running your SAS program many times to test and debug, you will have accumulated numerous pages of old and useless output and log.  Scrolling through and searching for the desired portion to read in either file can be tedious and difficult.  Thus, it’s really helpful to have the option of clearing all of the output and the log whenever you run your script.  I put the following commands on top of every one of my SAS scripts.

/*
Useful Options For Every SAS Program
- With Some Tips Learned From Dr. Jerry Brunner
by Eric Cai - The Chemical Statistician
*/

dm 'cle log; cle out;';
ods html closeods html;

dm 'odsresults; clear';
ods listing close;
ods listing;

## Vancouver Machine Learning and Data Science Meetup – NLP to Find User Archetypes for Search & Matching

I will attend the following seminar by Thomas Levi in the next R/Machine Learning/Data Science Meetup in Vancouver on Wednesday, June 25.  If you will also attend this event, please come up and say “Hello”!  I would be glad to meet you!

To register, sign up for an account on Meetup, and RSVP in the R Users Group, the Machine Learning group or the Data Science group.

Title: NLP to Find User Archetypes for Search & Matching

Speaker: Thomas Levi, Plenty of Fish

Location: HootSuite, 5 East 8th Avenue, Vancouver, BC

Time and Date: 6-8 pm, Wednesday, June 25, 2014

Abstract

As the world’s largest free dating site, Plenty Of Fish would like to be able to match with and allow users to search for people with similar interests. However, we allow our users to enter their interests as free text on their profiles. This presents a difficult problem in clustering, search and machine learning if we want to move beyond simple ‘exact match’ solutions to a deeper archetypal user profiling and thematic search system. Some of the common issues that arise are misspellings, synonyms (e.g. biking, cycling and bicycling) and similar interests (e.g. snowboarding and skiing) on a several million user scale. In this talk I will demonstrate how we built a system utilizing topic modelling with Latent Dirichlet Allocation (LDA) on a several hundred thousand word vocabulary over ten million+ North American users and explore its applications at POF.

Bio

Thomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His post-doctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC.  In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behaviour analysis, social network analysis, scam detection, Bot detection, matching algorithms, topic modelling and semantic analysis.

Schedule
• 6:00PM Doors are open, feel free to mingle
• 6:30 Presentations start
• 8:00 Off to a nearby watering hole (Mr. Brownstone?) for a pint, food, and/or breakout discussions

## Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.  A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of $\text{Sensitivity}$ on the vertical axis and $1 - \text{Specificity}$ on the horizontal axis, the line of no discrimination is the line that passes through the points $(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)$ and $(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1)$.  In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution $\text{Bernoulli}(0.5)$ is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of $1 - \text{Specificity}$, the $\text{Sensitivity}$ of the binary classifier is higher than the $\text{Sensitivity}$ of the line of no discrimination.)

## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

$y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

## Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a $p\text{th}$-order polynomial relationship between the response variable $Y$ and the explanatory variable $x$.

$Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon$

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).