## Online index of plots and corresponding R scripts

Dear Readers of The Chemical Statistician,

Joanna Zhao, an undergraduate researcher in the Department of Statistics at the University of British Columbia, produced a visual index of over 100 plots using ggplot2, the R package written by Hadley Wickham.

An example of a plot and its source R code on Joanna Zhao’s catalog.

Click on a thumbnail of any picture in this catalog – you will see the figure AND all of the necessary code to reproduce it.  These plots are from Naomi Robbins‘ book “Creating More Effective Graphs”.

If you

• want to produce an effective plot in R
• roughly know what the plot should look like
• but could really use an example to get started,

then this is a great resource for you!  A related GitHub repository has the code for ALL figures and the infrastructure for Joanna’s Shiny app.

I learned about this resource while working in my job at the British Columbia Cancer Agency; I am fortunate to attend a wonderful seminar series on statistics at the British Columbia Centre for Disease Control, and a colleague from this seminar told me about it.  By sharing this with you, I hope that it will immensely help you with your data visualization needs!

## Organic and Inorganic Chemistry Lesson of the Day – Stereoisomers

Two molecules are stereoisomers if they

• have the same molecular formula
• have the same sequence of bonds between each molecule’s constituent atoms
• have different 3-dimensional (spatial or geometric) orientations of the constituent atoms

Examples of stereoisomers include

It is important to emphasize that stereoisomers are defined for 2 or more molecules.  Consider 3 isomers, A, B and C.

• A and B may be stereoisomers.
• A and C may not be stereoisomers.  They may be structural isomers, which have the same atoms but different sequences of bonds.

## Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

#### Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

## Organic and Inorganic Chemistry Lesson of the Day – Optical Rotation is a Bulk Property

It is important to note that optical rotation is usually discussed as a bulk property, because it’s usually measured as a bulk property by a polarimeter.  Any individual enantiomeric molecule can almost certainly rotate linearly polarized light.  However, in a bulk sample of a chiral substance, there is usually another molecule that can rotate light in the opposite direction.  This is due to the uniform distribution of the stereochemistry of a random sample of the molecules of one compound.  (In other words, the substance consists of different stereoisomers of one compound, and the proportions of the different stereoisomers are roughly equal.)  Because one molecule’s rotation of the light can be cancelled by another molecule’s optical rotation in the opposite direction, such a random sample of the compound would have no net optical rotation.  This type of cancellation will definitely occur in a racemic mixture.  However, if a substance is enantiomerically pure, then all of the molecules in that substance will rotate linearly polarized light in the same direction – this substance is optically active.

## Organic and Inorganic Chemistry Lesson of the Day – The Difference Between (+)/(-) and (R)/(S) in Stereochemical Notation

In a previous Chemistry Lesson of the Day, I introduced the concept of optical rotation (a.k.a. optical activity).  You may also be familiar with the Cahn-Ingold-Prelog priority rules for designating stereogenic centres as either (R) or (S).   There is no direct association between the (+)/(-) designation and the (R)/(S) designation.  In other words, an (R)-enantiomer can be dextrorotary or levorotary – it must be determined on a case-by-case basis.  The same holds true for an (S)-enantiomer.

(R)/(S) can be used to distinguish between enantiomers in one exception: If the stereoisomer has only 1 stereogenic centre, then this designation can also serve as a way to distinguish between 2 enantiomers.

Furthermore, note that the designation of optical rotation applies to a molecule, whereas the R/S designation applies to a particular stereogenic centre within a molecule.  Thus, a molecule with 2 stereogenic centres may have one (R) stereogenic centre and one (S) stereogenic centre.  However, a chiral compound consisting purely of one enantiomer can rotate linearly polarized light in only one direction, and that direction must be determined on a case-by-case basis by a polarimeter.

## University of Toronto Alumni Reception with Meric Gertler – Tuesday, September 16, 2014 @ Sheraton Vancouver Wall Centre

I will attend the upcoming University of Toronto Alumni Reception in Vancouver to meet the new President of the University of Toronto, Meric Gertler.  If you will attend, please feel free to come up and say “Hello”!

Date: Tuesday, September 16, 2014

Time: 6:30 PM to 8:30 PM

Location:

Sheraton Vancouver Wall Centre
1088 Burrard St.
Vancouver, BC
V6Z 2R9

## Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function $f(x)$ that is continuous on the interval $[x_1, x_2]$, where $x_1$ and $x_2$ are any 2 points in the domain of $f(x)$.  Let

$x_m = 0.5x_1 + 0.5x_2$

be the midpoint of $x_1$ and $x_2$.  Then, if

$f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),$

then $f(x)$ is defined to be midpoint convex.

More generally, let’s consider any point within the interval $[x_1, x_2]$.  We can denote this arbitrary point as

$x_\lambda = \lambda x_1 + (1 - \lambda)x_2,$ where $0 < \lambda < 1$.

Then, if

$f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be convex.  If

$f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable $Y$ with a finite expected value and for any convex function $g(y)$,

$E[g(Y)] \geq g[E(Y)]$.

A function $f(x)$ is defined to be concave if $-f(x)$ is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable $Z$ with a finite expected value and for any concave function $h(z)$,

$E[h(Z)] \leq h[E(Z)]$.

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

## Organic and Inorganic Chemistry Lesson of the Day – DO NOT USE THE PREFIXES (d-) and (l-) TO CLASSIFY ENANTIOMERS

In a recent Chemistry Lesson of the Day, I introduced the concept of optical rotation, and I mentioned the use of (+) and (-) to denote dextrorotary and levorotary compounds, respectively.

Some people use d- and l- instead of (+) and (-), respectively.  I strongly discourage this, because there is an old system of classifying stereogenic centres that uses the prefixes D- and L-, and the obvious similarity between the prefixes of the 2 systems causes much confusion.

This old system classifies stereogenic centres based on the similarities of their configurations to the 2 enantiomers of glyceraldehyde.  It is confusing, non-intuitive, and outdated, so I will not discuss its rationale or details on my blog.  (If you are interested, here is a good explanation from the University of Maine’s chemistry department.)

Also, note that D- and L- classify stereogenic centres, whereas d- and l- classify enantiomers - this just adds more confusion.

In short,

• DO NOT use d- and l- to classify enantiomers; use (+) and (-) instead.
• DO NOT use D- and L- to classify stereogenic centres; use the Cahn-Ingold-Prelog priority rules (R/S) instead.

## Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of $n$ independent and identically distributed random variables, $X_1, X_2, ..., X_n$,

$P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.$

In other words, the empirical CDF of $X_1, X_2, ..., X_n$ converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

Markov’s inequality roughly says that a random variable $X$ is most frequently observed near its expected value, $\mu$.  Remarkably, it quantifies just how often $X$ is far away from $\mu$.  Chebyshev’s inequality goes one step further and quantifies that distance between $X$ and $\mu$ in terms of the number of standard deviations away from $\mu$.  It roughly says that the probability of $X$ being $k$ standard deviations away from $\mu$ is at most $k^{-2}$.  Notice that this upper bound decreases as $k$ increases – confirming our intuition that it is highly improbable for $X$ to be far away from $\mu$.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable $X$, as long as $E(X)$ and $V(X)$ are finite.  (Markov’s inequality requires only $E(X)$ to be finite.)  This is quite a marvelous result!

## Organic and Inorganic Chemistry Lesson of the Day – Optical Rotation (a.k.a. Optical Activity)

A substance consisting of a chiral compound can rotate linearly polarized light – this property is known as optical rotation (more commonly called optical activity).  The direction in which the light is rotated is one way to distinguish between a pair of enantiomers, as they rotate linearly polarized light in opposite directions.

Imagine if you are an enantiomer, and linearly polarized light approaches you.

• If the light is rotated clockwise from your perspective, then you are a dextrorotary enantiomer.
• Otherwise, if the light is rotated counterclockwise from your perspective, then you are a levorotary enantiomer.

In a previous Chemistry Lesson of the Day, I introduced the concept of diastereomers, and I used threose as an example.  Let’s use threose to illustrate some notation about optical activity.

(-)-Threose

• Levorotary compounds are denoted by the prefix (-), followed by a hyphen, then followed by the name of the compound.  The above molecule is (-)-threose.
• Dextrorotary compounds are denoted by the prefix (+), followed by a hyphen, then followed by the name of the compound.  The enantiomer of (-)-threose is (+)-threose.

A compound’s optical rotation is determined by a polarimeter.

I strongly discourage the use of the prefixes (d)- and (l-) to distinguish between enantiomers.  Use (+) and (-) instead.

Beware of the difference between designating enantiomers as (+) or (-) and designating stereogenic centres as either (R) or (S).

It is important to note that optical rotation is usually referred to as a bulk property.

## Mathematical Statistics Lesson of the Day – Chebyshev’s Inequality

The variance of a random variable $X$ is just an expected value of a function of $X$.  Specifically,

$V(X) = E[(X - \mu)^2], \ \text{where} \ \mu = E(X)$.

Let’s substitute $(X - \mu)^2$ into Markov’s inequality and see what happens.  For convenience and without loss of generality, I will replace the constant $c$ with another constant, $b^2$.

$\text{Let} \ b^2 = c, \ b > 0. \ \ \text{Then,}$

$P[(X - \mu)^2 \geq b^2] \leq E[(X - \mu)^2] \div b^2$

$P[ (X - \mu) \leq -b \ \ \text{or} \ \ (X - \mu) \geq b] \leq V(X) \div b^2$

$P[|X - \mu| \geq b] \leq V(X) \div b^2$

Now, let’s substitute $b$ with $k \sigma$, where $\sigma$ is the standard deviation of $X$.  (I can make this substitution, because $\sigma$ is just another constant.)

$\text{Let} \ k \sigma = b. \ \ \text{Then,}$

$P[|X - \mu| \geq k \sigma] \leq V(X) \div k^2 \sigma^2$

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

This last inequality is known as Chebyshev’s inequality, and it is just a special version of Markov’s inequality.  In a later Statistics Lesson of the Day, I will discuss the motivation and intuition behind it.  (Hint: Read my earlier lesson on the motivation and intuition behind Markov’s inequality.)

## Getting Ready for Mathematical Classes in the New Semester – Guest-Blogging on SFU’s Career Services Informer

The following blog post was slightly condensed for editorial brevity and then published on the Career Services Informer, the official blog of the Career Services Centre at my undergraduate alma mater, Simon Fraser University

As a new Fall semester begins, many students start courses such as math, physics, computing science, engineering and statistics.  These can be tough classes with a rapid progression in workload and difficulty, but steady preparation can mount a strong defense to the inevitable pressure and stress.  Here are some tips to help you to get ready for those classes.

## Organic and Inorganic Chemistry Lesson of the Day – Cis/Trans Isomers Are Diastereomers

Recall that the definition of diastereomers is simply 2 molecules that are NOT enantiomers.  Diastereomers often have at least 2 stereogenic centres, and my previous lesson showed an example of how such diastereomers can arise.

However, while an enantiomer must have at least 1 stereogenic centre, there is nothing in the definition of a diastereomer that requires it to have any stereogenic centres.  In fact, a diastereomer does not have to be chiral.  A pair of cis/trans isomers are also diastereomers.  Recall the example of trans-1,2-dibromoethylene and cis-1,2-dibromoethylene:

Image courtesy of Roland1952 on Wikimedia.

These 2 molecules are stereoisomers – they have the same atoms and sequence/connectivity of bonds, but they differ in their spatial orientations.  They are NOT mirror images of each other, let alone non-superimposable mirror images.  Thus, by definition, they are diastereomers, even though they are not chiral.

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

$P(X \geq c) \leq E(X) \div c,$ where $c > 0$.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable $X$ “far” away from its “centre” or “central value”?

Intuitively, the “central value” of $X$ is the value that of $X$ that is most commonly (or most frequently) observed.  Thus, as $X$ deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, $E(X)$, is a measure of the “centre” of $X$.  Thus, we would expect that the probability of $X$ being very far away from $E(X)$ is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As $c$ becomes really far away from $E(X)$, the event $X \geq c$ becomes less probable.

You can confirm this by substituting several key values of $c$.

• If $c = E(X)$, then $P[X \geq E(X)] \leq 1$; this is the highest upper bound that $P(X \geq c)$ can get.  This makes intuitive sense; $X$ is going to be frequently observed near its own expected value.

• If $c \rightarrow \infty$, then $P(X \geq \infty) \leq 0$.  By Kolmogorov’s axioms of probability, any probability must be inclusively between $0$ and $1$, so $P(X \geq \infty) = 0$.  This makes intuitive sense; there is no possible way that $X$ can be bigger than positive infinity.

## Organic and Inorganic Chemistry Lesson of the Day – Meso Isomers

A molecule is a meso isomer if it

Meso isomers have an internal plane of symmetry, which arises from 2 identically substituted but oppositely oriented stereogenic centres.  (By “oppositely oriented”, I mean the stereochemical orientation as defined by the Cahn-Ingold-Prelog priority system.  For example, in a meso isomer with 2 tetrahedral stereogenic centres, one stereogenic centre needs to be “R”, and the other stereogenic centre needs to be “S”. )  This symmetry results in the superimposability of a meso isomer’s mirror image.

By definition, a meso isomer and an enantiomer from the same stereoisomer are a pair of diastereomers.

Having at least 2 stereogenic centres is a necessary but not sufficient condition for a molecule to have meso isomers.  Recall that a molecule with $n$ tetrahedral stereogenic centres has at most $2^n$ stereoisomers; such a molecule would have less than $2^n$ stereoisomers if it has meso isomers.

Meso isomers are also called meso compounds.

Here is an example of a meso isomer; notice the internal plane of symmetry – the horizontal line that divides the 2 stereogenic carbons:

(2R,3S)-tartaric acid

Image courtesy of Project Osprey from Wikimedia (with a slight modification).

## The Chi-Squared Test of Independence – An Example in Both R and SAS

#### Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, $X$ and $Y$, the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and alternative hypotheses:

$H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y$

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities.  For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

## Mathematical Statistics Lesson of the Day – Markov’s Inequality

Markov’s inequality is an elegant and very useful inequality that relates the probability of an event concerning a non-negative random variable, $X$, with the expected value of $X$.  It states that

$P(X \geq c) \leq E(X) \div c,$

where $c > 0$.

I find Markov’s inequality to be beautiful for 2 reasons:

1. It applies to both continuous and discrete random variables.
2. It applies to any non-negative random variable from any distribution with a finite expected value.

In a later lesson, I will discuss the motivation and intuition behind Markov’s inequality, which has useful implications for understanding a data set.

## Organic and Inorganic Chemistry Lesson of the Day – Racemic Mixtures

A racemic mixture is a mixture that contains equal amounts of both enantiomers of a chiral molecule.  (By amount, I mean the usual unit of quantity in chemistry – the mole.  Of course, since enantiomers are isomers, their molar masses are equal, so a racemic mixture would contain equal masses of both enantiomers, too.)

In synthesizing enantiomers, if a set of reactants combine to form a racemic mixture, then the reactants are called non-stereoselective or non-stereospecific.

in 1895, Otto Wallach proposed that a racemic crystal is more dense than a crystal with purely one of the enantiomers; this is known as Wallach’s rule.  Brock et al. (1991) substantiated this with crystallograhpic data.

Reference:

Brock, C. P., Schweizer, W. B., & Dunitz, J. D. (1991). On the validity of Wallach’s rule: on the density and stability of racemic crystals compared with their chiral counterparts. Journal of the American Chemical Society, 113(26), 9811-9820.

## Applied Statistics Lesson of the Day – The Coefficient of Variation

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set.  However, consider the following 2 hypothetical cases:

1. the standard deviation for the incomes of households in Canada is $2,000 2. the standard deviation for the incomes of the 5 major banks in Canada is$2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas$2,000 is not a lot of money for one of the “Big Five” banks.  Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets.  One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

$\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},$

where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

Analogously, the coefficient of variation for a random variable is

$\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,$

where $\sigma$ is the random variable’s standard deviation and $\mu$ is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes.  I hope that all new statistics students get to learn this alternative measure of dispersion.