## Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

• 5% in the 1st year
• 10% in the 2nd year
• 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of $1 + r$, where $r$ is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, $G$.  The average factor of growth from the 3 years is

$G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n}$,

where $r_i$ is the rate of return in year $i$, $i = 1, 2, 3, ..., n$.  The average annual rate of return is $G - 1$.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

Returning to our example, our average factor of growth is

$G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242$.

Thus, our annual rate of return is $G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%$.

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

1. The first set, $S_1$, consists of your data $x_1, x_2, ..., x_n$, and this set has a sample size of $n$.
2. The second, $S_2$,  set also has a sample size of $n$, but all $n$ values are the same – let’s call this common value $y$.
• What number must $y$ be such that the sums in $S_1$ and $S_2$ are equal?  This value of $y$ is the arithmetic mean of the first set.
• What number must $y$ be such that the products in $S_1$ and $S_2$ are equal?  This value of $y$ is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

## Inorganic Chemistry Lesson of the Day – 2 Different Ways for Chirality to Arise in Coordination Complexes

In a previous Chemistry Lesson of the Day, I introduced chirality and enantiomers in organic chemistry; recall that chirality in organic chemistry often arises from an asymmetric carbon that is attached to 4 different substituents.  Chirality is also observed in coordination complexes in inorganic chemistry.  There are 2 ways for chirality to be observed in coordination complexes:

1.   The metal centre has an asymmetric arrangement of ligands around it.

• This type of chirality can be observed in octahedral complexes and tetrahedral complexes, but not square planar complexes.  (Recall that square planar complexes have a plane formed by the metal and its 4 ligands.  This plane can serve as a plane of reflection, and any mirror image of a square planar complex across this plane is clearly superimposable onto itself, so it cannot have chirality just by having 4 different ligands alone.)

2.   The metal centre has a chiral ligand (i.e. the ligand itself has a non-superimposable mirror image).

• Following the sub-bullet under Point #1, a square planar complex can be chiral if it has a chiral ligand.

## Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

$H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i)$.

For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

1. 60 km/hr for 100 km
2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

$(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]$

$= 46.45 \text{ km/hr}$

Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

$\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}$

## Organic and Inorganic Chemistry Lesson of the Day – Chirality and Enantiomers

In chemistry, chirality is a property of a molecule such that the molecule has a non-superimposable mirror image.  In other words, a molecule is chiral if, upon reflection by any plane, it cannot be superimposed onto itself.

Chirality is a property of the 3-dimensional orientation of a molecule, and molecules exhibiting chirality are stereoisomers.  Specifically, two molecules are enantiomers of each other if they are non-superimposable mirror images of each other.  In organic chemistry, chirality commonly arises out of an asymmetric carbon atom, which is a carbon that is attached to 4 different substituents.  Chirality in inorganic chemistry is more complicated, and I will discuss this in a later lesson.

It is important to note that enantiomers are defined as pairs.  This will be later emphasized in the lesson on diastereomers.

## Useful Options For Every SAS Program – Lessons and Resources from Dr. Jerry Brunner

#### Introduction

Today, I want to share some useful options to put at the beginning of every SAS program that I write.  These options will make the practicality of using SAS much easier.  My applied statistics professor from the University of Toronto, Jerry Brunner*, taught me some of these options when I first learned SAS in our class, and I’m grateful for that.  In later instances of using SAS in team projects, I have met SAS programmers who were delightfully surprised by the existence of these options and desperately wished that they had learned them earlier.  I hope that they will help you with your SAS programming.  I have also learned some useful options by posting questions on the SAS Support Communities online forum.

#### Clearing Output

After running your SAS program many times to test and debug, you will have accumulated numerous pages of old and useless output and log.  Scrolling through and searching for the desired portion to read in either file can be tedious and difficult.  Thus, it’s really helpful to have the option of clearing all of the output and the log whenever you run your script.  I put the following commands on top of every one of my SAS scripts.

/*
Useful Options For Every SAS Program
- With Some Tips Learned From Dr. Jerry Brunner
by Eric Cai - The Chemical Statistician
*/

dm 'cle log; cle out;';
ods html close;
ods html;

dm 'odsresults; clear';
ods listing close;
ods listing;

## Vancouver Machine Learning and Data Science Meetup – NLP to Find User Archetypes for Search & Matching

I will attend the following seminar by Thomas Levi in the next R/Machine Learning/Data Science Meetup in Vancouver on Wednesday, June 25.  If you will also attend this event, please come up and say “Hello”!  I would be glad to meet you!

To register, sign up for an account on Meetup, and RSVP in the R Users Group, the Machine Learning group or the Data Science group.

Title: NLP to Find User Archetypes for Search & Matching

Speaker: Thomas Levi, Plenty of Fish

Location: HootSuite, 5 East 8th Avenue, Vancouver, BC

Time and Date: 6-8 pm, Wednesday, June 25, 2014

Abstract

As the world’s largest free dating site, Plenty Of Fish would like to be able to match with and allow users to search for people with similar interests. However, we allow our users to enter their interests as free text on their profiles. This presents a difficult problem in clustering, search and machine learning if we want to move beyond simple ‘exact match’ solutions to a deeper archetypal user profiling and thematic search system. Some of the common issues that arise are misspellings, synonyms (e.g. biking, cycling and bicycling) and similar interests (e.g. snowboarding and skiing) on a several million user scale. In this talk I will demonstrate how we built a system utilizing topic modelling with Latent Dirichlet Allocation (LDA) on a several hundred thousand word vocabulary over ten million+ North American users and explore its applications at POF.

Bio

Thomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His post-doctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC.  In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behaviour analysis, social network analysis, scam detection, Bot detection, matching algorithms, topic modelling and semantic analysis.

Schedule
• 6:00PM Doors are open, feel free to mingle
• 6:30 Presentations start
• 8:00 Off to a nearby watering hole (Mr. Brownstone?) for a pint, food, and/or breakout discussions

## Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of

$\text{Sensitivity}$

on the vertical axis and

$1 - \text{Specificity}$

on the horizontal axis, the line of no discrimination is the line that passes through the points

$(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)$

and

$(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1)$.

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution $\text{Bernoulli}(0.5)$ is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of $1 - \text{Specificity}$, the $\text{Sensitivity}$ of the binary classifier is higher than the $\text{Sensitivity}$ of the line of no discrimination.)

## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

$y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

## Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a $p\text{th}$-order polynomial relationship between the response variable $Y$ and the explanatory variable $x$.

$Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon$

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

## Organic and Inorganic Chemistry Lesson of the Day – Cis/Trans Isomers

Cis/Trans isomerism is a type of stereoisomerism in which the relative positions of 2 functional groups differ between the isomers.  An isomer is cis if the 2 functional groups of interest are closer to each other, and trans if they are farther from each other.  You may find these definitions to be non-rigorous based on the subjectivity of “closer” and “farther”, but cis/trans isomers have only 2 possible relative positions for these functional groups, so “closer” and “farther” are actually obvious to identify.  It’s easier to illustrate this with some examples.

Image courtesy of Roland1952 on Wikimedia.

The molecule on the left is trans-1,2-dibromoethylene, and the molecule on the right is cis-1,2-dibromoethylene.  The 2 functional groups of interest are the 2 bromides, and the isomerism arises from the 2 different ways that these bromides can be positioned relative to each other.  (Notice that the 2 bromides are bonded to different carbon atoms, thus the “1,2-” designation in its name.)  Relative to the other bromide, one bromide can either be on the same of the double bond (“closer”) or on the opposite side of the double bond (“farther”).  To view the isomerism from another perspective, the double bond serves as the plane of separation, and the bromides can be on different sides of that plane (trans) or the same sides of the plane (cis).  Cis/Trans isomerism often arises in organic chemistry because of a bond with restricted rotation, and such restriction is often due to a double bond or a ring structure.  Such a bond often serves as the plane of separation on which the relative positions of the 2 functional groups can be established.

Let’s now consider a coordination complex in inorganic chemistry.

Image courtesy of Anypodetos on Wikimedia.

Cisplatin and transplatin are both 4-coordinated complexes with a square planar geometry.  Their ligands are 2 chlorides and 2 ammonias.  When looking at the pictures above, it’s obvious that there are only 2 relative positions for one chloride to take compared to the other chloride – they can be either closer to each other (cis) or farther apart (trans).

Cis/Trans isomerism can also arise in 6-coordinated octahedral complexes in inorganic chemistry.

## Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

The harmonic mean, H, for $n$ positive real numbers $x_1, x_2, ..., x_n$ is defined as

$H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}$.

This type of mean is useful for measuring the average of rates.  For example, consider a car travelling for 240 kilometres at 2 different speeds:

1. 60 km/hr for 120 km
2. 40 km/hr for another 120 km

Then its average speed for this trip is

$S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}$

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds.  If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics.  The average speed is defined as

$S_{avg} = \Delta \text{distance} \div \Delta \text{time}$

We already have the elapsed distance – it’s 240 km.  Let’s find the time elapsed for this trip.

$\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})$

$\Delta \text{time} = 5 \text{ hours}$

Thus,

$S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}$

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

## Inorganic Chemistry Lesson of the Day: 5-Coordinated Complexes

There are 2 common geometries for 5-coordinated complexes:

• Square pyramid: The metal centre is coordinated to 4 ligands in a plane and a 5th ligand above the plane.
• Trigonal bipyramid: The metal centre is coordinated to 3 ligands in a plane and 2 lignads above and below the plane.

## Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Can Apply to the Sum

The central limit theorem (CLT) is often stated in terms of the sample mean of independent and identically distributed random variables.  An often unnoticed or forgotten aspect of the CLT is its applicability to the sample sum of those variables.  Since $n$, the sample size, is just a constant, it can be multiplied to $\bar{X}$ to obtain $\sum_{i = 1}^{n} X_i$.  For a sufficiently large $n$, this new statistic still has an approximately normal distribution, just with a new expected value and a new variance.

$\sum_{i = 1}^{n} X_i \overset{approx.}{\sim} \text{Normal} (n\mu, n\sigma^2)$

## Inorganic Chemistry Lesson of the Day: 2-Coordinated Complexes

Some coordination complexes have just 2 ligands attached to the metal centre.  These complexes have a linear geometry; this allows the greatest separation of the electron clouds in the metal-ligand bonds, which minimizes electron repulsion.

## Applied Statistics Lesson of the Day – What “Linear” in Linear Regression Really Means

Linear regression is one of the most commonly used tools in statistics, yet one of its fundamental features is commonly misunderstood by many non-statisticians.  I have witnessed this misunderstanding on numerous occasions in my work experience, statistical consulting and statistical education, and it is important for all statisticians to be aware of this common misunderstanding, to anticipate it when someone is about to make this mistake, and to educate that person about the correct meaning.

Consider the simple linear regression model:

$Y = \beta_0 + \beta_1x + \varepsilon$.

The “linear” in linear regression refers to the linearity between the response variable ($Y$) and the regression coefficients ($\beta_0$ and $\beta_1$).  It DOES NOT refer to the linearity between the response variable ($Y$) and the explanatory variable ($x$) This is contrary to mathematical descriptions of linear relationships; for example, when high school students learn about the equation of a line,

$y = mx + b$

the relationship is called “linear” because of the linearity between $y$ and $x$.  This is the source of the mistaken understanding about the meaning of “linear” in linear regression; I am grateful that my applied statistics professor, Dr. Boxin Tang, emphasized the statistical meaning of “linear” when he taught linear regression to me.

Why is this difference in terminology important?  A casual observer may be puzzled by this apparent nit-picking of the semantics.  This terminology is important because the estimation of the regression coefficients in a regression model depends on the relationship between the response variable and the regression coefficients.  If this relationship is linear, then the estimation is very simple and can be done analytically by linear algebra.  If not, then the estimation can be very difficult and often cannot be done analytically – numerical methods must be used, instead.

Now, one of the assumptions of linear regression is the linearity between the response variable ($Y$) and the explanatory variable ($x$).  However, what if the scatter plot of $Y$ versus $x$ reveals a non-linear relationship, such as a quadratic relationship?  In that case, the solution is simple – just replace $x$ with $x^2$.  (Admittedly, if the interpretation of the regression coefficient is important, then such interpretation becomes more difficult with this transformation.  However, if prediction of the response is the key goal, then such interpretation is not necessary, and this is not a problem.)  The important point is that the estimation of the regression coefficients can still be done by linear algebra after the transformation of the explanatory variable.

## Inorganic Chemistry Lesson of the Day: 4-Coordinated Complexes

My last lesson stated that the most common coordination number for coordination complexes is 6.  The next most common coordination number is 4, and complexes with this type of coordination adopt either the tetrahedral or the square planar geometry.  The tetrahedron is far more common than the square plane for 4-coordinated complexes, and the type of geometry depends a lot on the size and bonding strength of the ligands.  If the ligands are too big, then a tetrahedral geometry provides greater separation between ligands and minimizes electron repulsion.  If the ligands are too small, then there is room for 2 extra ligands to bond to the metal centre to form a 6-coordinated complex, and an octahedral geometry is adopted instead.

The square planar geometry is usually adopted by 4-coordinated complexes with metal ions that have a d8 electronic configuration.  Examples of such ions include Ni2+, Pd2+, Pt2+, and Au3+.

## Determining chemical concentration with standard addition: An application of linear regression in JMP – A Guest Blog Post for the JMP Blog

I am very excited to announce that I have been invited by JMP to be a guest blogger for its official blog!  My thanks to Arati Mejdal, Global Social Media Manager for the JMP Division of SAS, for welcoming me into the JMP blogging community with so much support and encouragement, and I am pleased to publish my first post on the JMP Blog!  Mark Bailey and Byron Wingerd from JMP provided some valuable feedback to this blog post, and I am fortunate to get the chance to work with and learn from them!

Following the tradition of The Chemical Statistician, this post combines my passions for statistics and chemistry by illustrating how simple linear regression can be used for the method of standard addition in analytical chemistry.  In particular, I highlight the useful capability of the “Inverse Prediction” function under “Fit Model” platform in JMP to estimate the predictor given an observed response value (i.e. estimate the value of $x_i$ given $y_i$).  Check it out!

## Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves

A receiver operating characteristic (ROC) curve is a 2-dimensional plot of the $\text{Sensitivity}$ (the true positive rate) versus $1 - \text{Specificity}$ (1 minus the true negative rate) of a binary classifier while varying its discrimination threshold.  In statistics and machine learning, a basic and popular tool for binary classification is logistic regression, and an ROC curve is a useful way to assess the predictive accuracy of the logistic regression model.

To illustrate with an example, let’s consider the Bernoulli response variable $Y$ and the covariates $X_1, X_2, ..., X_p$.  A logistic regression model takes the covariates as inputs and returns $P(Y = 1)$.  You as the user of the model must decide above which value of $P(Y = 1)$ you will predict that $Y = 1$; this value is the discrimination threshold.  A common threshold is $P(Y = 1) = 0.5$.

Once you finish fitting the model with a training set, you can construct an ROC curve by following these steps below:

1. Set a discrimination threshold.
2. Use the covariates to predict $Y$ for each observation in a validation set.
3. Since you have the actual response values in the validation set, you can then calculate the sensitivity and specificity for your logistic regression model at that threshold.
4. Repeat Steps 1-3 with a new threshold.
5. Plot the values of $\text{Sensitivity}$ versus $1 - \text{Specificity}$ for all thresholds.  The result is your ROC curve.

The use of a validation set to assess the predictive accuracy of a model is called validation, and it is a good practice for supervised learning.  If you have another fresh data set, it is also good practice to use that as a test set to assess the predictive accuracy of your model.

Note that you can perform Steps 2-5 for the training set, too – this is often done in statistics when you don’t have many data to work with, and the best that you can do is to assess the predictive accuracy of your model on the data set that you used to fit the model.

## Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of

$h(t) = f(t) \div S(t)$.

In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity to be written as a function of the other.

I am excited to continue adding to my Youtube channel‘s collection of video tutorials.  Please stay tuned for more!

## Inorganic Chemistry Lesson of the Day: 6-Coordinated Complexes

The most common coordination number for inorganic coordination complexes is 6, and these complexes will most commonly adopt an octahedral geometry.  This geometry is especially common for coordination complexes with a first-row transition metal ion as the Lewis-acid centre.  It consists of 4 ligands forming a plane, and 2 ligands above and below the plane.  The “octa-” prefix in “octahedral” refers to the 8 faces that this geometry has.

Two alternative geometries of 6-coordinated complexes are the trigonal prism and hexagonal plane; these are far less common than the octahedron.