## Video Tutorial: Allelic Frequencies Remain Constant From Generation to Generation Under the Hardy-Weinberg Equilibrium

The Hardy-Weinberg law is a fundamental principle in statistical genetics.  If its 7 assumptions are fulfilled, then it predicts that the allelic frequency of a genetic trait will remain constant from generation to generation.  In this new video tutorial in my Youtube channel, I explain the math behind the Hardy-Weinberg theorem.  In particular, I clarify the origin of the connection between allelic frequencies and genotyopic frequencies in the second generation – I have not found a single textbook or web site on this topic that explains this calculation, so I hope that my explanation is helpful to you.

You can also watch the video below the fold!

## Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables.  The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily.  In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals.  In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

You can also watch the video below the fold!

## Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

• 5% in the 1st year
• 10% in the 2nd year
• 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of $1 + r$, where $r$ is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, $G$.  The average factor of growth from the 3 years is

$G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n}$,

where $r_i$ is the rate of return in year $i$, $i = 1, 2, 3, ..., n$.  The average annual rate of return is $G - 1$.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

Returning to our example, our average factor of growth is

$G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242$.

Thus, our annual rate of return is $G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%$.

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

1. The first set, $S_1$, consists of your data $x_1, x_2, ..., x_n$, and this set has a sample size of $n$.
2. The second, $S_2$,  set also has a sample size of $n$, but all $n$ values are the same – let’s call this common value $y$.
• What number must $y$ be such that the sums in $S_1$ and $S_2$ are equal?  This value of $y$ is the arithmetic mean of the first set.
• What number must $y$ be such that the products in $S_1$ and $S_2$ are equal?  This value of $y$ is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

## Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

$H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i)$.

For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

1. 60 km/hr for 100 km
2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

$(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]$

$= 46.45 \text{ km/hr}$

Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

$\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}$

## Vancouver Machine Learning and Data Science Meetup – NLP to Find User Archetypes for Search & Matching

I will attend the following seminar by Thomas Levi in the next R/Machine Learning/Data Science Meetup in Vancouver on Wednesday, June 25.  If you will also attend this event, please come up and say “Hello”!  I would be glad to meet you!

To register, sign up for an account on Meetup, and RSVP in the R Users Group, the Machine Learning group or the Data Science group.

Title: NLP to Find User Archetypes for Search & Matching

Speaker: Thomas Levi, Plenty of Fish

Location: HootSuite, 5 East 8th Avenue, Vancouver, BC

Time and Date: 6-8 pm, Wednesday, June 25, 2014

Abstract

As the world’s largest free dating site, Plenty Of Fish would like to be able to match with and allow users to search for people with similar interests. However, we allow our users to enter their interests as free text on their profiles. This presents a difficult problem in clustering, search and machine learning if we want to move beyond simple ‘exact match’ solutions to a deeper archetypal user profiling and thematic search system. Some of the common issues that arise are misspellings, synonyms (e.g. biking, cycling and bicycling) and similar interests (e.g. snowboarding and skiing) on a several million user scale. In this talk I will demonstrate how we built a system utilizing topic modelling with Latent Dirichlet Allocation (LDA) on a several hundred thousand word vocabulary over ten million+ North American users and explore its applications at POF.

Bio

Thomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His post-doctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC.  In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behaviour analysis, social network analysis, scam detection, Bot detection, matching algorithms, topic modelling and semantic analysis.

Schedule
• 6:00PM Doors are open, feel free to mingle
• 6:30 Presentations start
• 8:00 Off to a nearby watering hole (Mr. Brownstone?) for a pint, food, and/or breakout discussions

## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

$y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

## Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a $p\text{th}$-order polynomial relationship between the response variable $Y$ and the explanatory variable $x$.

$Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon$

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

## Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

The harmonic mean, H, for $n$ positive real numbers $x_1, x_2, ..., x_n$ is defined as

$H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}$.

This type of mean is useful for measuring the average of rates.  For example, consider a car travelling for 240 kilometres at 2 different speeds:

1. 60 km/hr for 120 km
2. 40 km/hr for another 120 km

Then its average speed for this trip is

$S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}$

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds.  If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics.  The average speed is defined as

$S_{avg} = \Delta \text{distance} \div \Delta \text{time}$

We already have the elapsed distance – it’s 240 km.  Let’s find the time elapsed for this trip.

$\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})$

$\Delta \text{time} = 5 \text{ hours}$

Thus,

$S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}$

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

## Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of

$h(t) = f(t) \div S(t)$.

In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity to be written as a function of the other.

I am excited to continue adding to my Youtube channel‘s collection of video tutorials.  Please stay tuned for more!

You can also watch this new video below the fold!

## Presentation on Statistical Genetics at Vancouver SAS User Group – Wednesday, May 28, 2014

I am excited and delighted to be invited to present at the Vancouver SAS User Group‘s next meeting.  I will provide an introduction to statistical genetics; specifically, I will

• define basic terminology in genetics
• explain the Hardy-Weinberg equilibrium in detail
• illustrate how Pearson’s chi-squared goodness-of-fit test can be used in PROC FREQ in SAS to check the Hardy-Weinberg equilibrium
• illustrate how the Newton-Raphson algorithm can be used for maximum likelihood estimation in PROC IML in SAS

You can register for this meeting here.  The meeting’s coordinates are

9:00am – 3:00pm

Wednesday, May 28th, 2014

BC Cancer Agency Research Centre

675 West 10th Avenue.

Vancouver, BC

If you will attend this meeting, please feel free to come up and say “Hello!”.  I look forward to meeting you!

## Machine Learning and Applied Statistics Lesson of the Day – Sensitivity and Specificity

To evaluate the predictive accuracy of a binary classifier, two useful (but imperfect) criteria are sensitivity and specificity.

Sensitivity is the proportion of truly positives cases that were classified as positive; thus, it is a measure of how well your classifier identifies positive cases.  It is also known as the true positive rate.  Formally,

$\text{Sensitivity} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives + Number of False Negatives)}$

Specificity is the proportion of truly negative cases that were classified as negative; thus, it is a measure of how well your classifier identifies negative cases.  It is also known as the true negative rate.  Formally,

$\text{Specificity} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives + Number of False Positives)}$

## Applied Statistics Lesson and Humour of the Day – Type I Error (False Positive) and Type 2 Error (False Negative)

In hypothesis testing,

• a Type 1 error is the rejection of the null hypothesis when it is actually true
• a Type 2 error is the acceptance of the null hypothesis when it is actually false.  (Some statisticians prefer to say “failure to reject” rather than “accept” the null hypothesis for Type 2 errors.)

A Type 1 error is also known as a false positive, and a Type 2 error is also known as a false negative.  This nomenclature comes from the conventional connotation of

• the null hypothesis as the “negative” or the “boring” result
• the alternative hypothesis as the “positive” or “exciting” result.

A great way to illustrate the meaning and the intuition of Type 1 errors and Type 2 errors is the following cartoon.

Source of Image: Effect Size FAQs by Paul Ellis

In this case, the null hypothesis (or the “boring” result) is “You’re not pregnant”, and the alternative hypothesis (or the “exciting” result) is “You’re pregnant!”.

This is the most effective way to explain Type 1 error and Type 2 error that I have encountered!

## Statistics Lesson and Warning of the Day – Confusion Between the Median and the Average

Yesterday, I attended an interesting seminar called “Transforming Healthcare through Big Data” at the Providence Health Care Research Institute‘s 2014 Research Day.  The seminar was delivered by Martin Kohn from Jointly Health, and I enjoyed it overall.  However, I noticed a glaring error about basic statistics that needs correction.

Martin wanted to highlight the overconfidence that many doctors have about their abilities, and he quoted Vinod Kohsla, the co-founder of Sun Microsystems, who said, “50% of doctors are below average.”  Martin then presented a study showing an absurdly high percentage of doctors who think that they are “above average”.  A Twitter conversation between attendees of a TED conference in San Francisco and Vinod himself confirms this quotation.

The statement “50% of doctors are below average” is wrong in general.  By definition, 50% of any population is below the median, and the median is only equal to the average if the population is symmetric.  (Examples of symmetric probability distributions are the normal distribution and the Student’s t-distribution.)  Vinod meant to say that “50% of doctors are below the median”, and he confirmed this in the aforementioned Twitter conversation; I am disappointed that he justified this mistake by claiming that it would be less understood.  I think that a TED audience would know what “median” means, and those who don’t can easily search for its meaning online or in books on their own.

In communicating truth, let’s use the correct vocabulary.

## Applied Statistics Lesson of the Day – Notation for Fractional Factorial Designs

Fractional factorial designs use the $L^{F-p}$ notation; unfortunately, this notation is not clearly explained in most textbooks or web sites about experimental design.  I hope that my explanation below is useful.

• $L$ is the number of levels in each factor; note that the $L^{F-p}$ notation assumes that all factors have the same number of levels.
• If a factor has 2 levels, then the levels are usually coded as $+1$ and $-1$.
• If a factor has 3 levels, then the levels are usually coded as $+1$, $0$, and $-1$.
• $F$ is the number of factors in the experiment
• $p$ is the number of times that the full factorial design is fractionated by $L$.  This number is badly explained by most textbooks and web sites that I have seen, because they simply say that $p$ is the fraction – this is confusing, because a fraction has a numerator and a denominator, and $p$ is just 1 number.  To clarify,
• the fraction is $L^{-p}$
• the number of treatments in the fractional factorial design is $L^{-p}$ multiplied by the total possible number of treatments in the full factorial design, which is $L^F$.

If all $L^F$ possible treatments are used in the experiment, then a full factorial design is used.  If a fractional factorial design is used instead, then $L^{-p}$ denotes the fraction of the $L^F$ treatments that is used.

Most factorial experiments use binary factors (i.e. factors with 2 levels, $L = 2$).  Thus,

• if $p = 1$, then the fraction of treatments that is used is $2^{-1} = 1/2$.
• if $p = 2$, then the fraction of treatments that is used is $2^{-2} = 1/4$.

This is why

• a $2^{F-1}$ design is often called a half-fraction design.
• a $2^{F-2}$ design is often called a quarter-fraction design.

However, most sources that I have read do not bother to mention that $L$ can be greater than 2; experiments with 3-level factors are less frequent but still common.  Thus, the terms half-fraction design and half-quarter design only apply to binary factors.  If $L = 3$, then

• a $3^{F-1}$ design uses one-third of all possible treatments.
• a $3^{F-2}$ design uses one-ninth of all possible treatments.

## Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

• $n$ random variables, $X_1, ..., X_n$, are independent and identically distributed,
• $n$ is sufficiently large,

then the distribution of their sample mean, $\bar{X_n}$, is approximately normal, and this approximation is better as $n$ increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of $X_1, ..., X_n$.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

You can also watch the video below the fold!

## Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations.  For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

## Machine Learning Lesson of the Day – Linear Gaussian Basis Function Models

I recently introduced the use of linear basis function models for supervised learning problems that involve non-linear relationships between the predictors and the target.  A common type of basis function for such models is the Gaussian basis function.  This type of model uses the kernel of the normal (or Gaussian) probability density function (PDF) as the basis function.

$\phi_j(x) = exp[-(x - \mu_j)^2 \div 2\sigma^2]$

The $\sigma$ in this basis function determines the spacing between the different basis functions that combine to form the model.

Notice that this is just the normal PDF without the scaling factor of $1/\sqrt{2\pi \sigma^2}$; the scaling factor ensures that the normal PDF integrates to 1 over its support set.  In a linear basis function model, the regression coefficients are the weights for the basis functions, and these weights will scale Gaussian basis functions to fit the data that are local to $\mu_j$.  Thus, there is no need to include that scaling factor of $1/\sqrt{2\pi \sigma^2}$, because the scaling is already being handled by the regression coefficients.

The Gaussian basis function model is useful because

• it can model many non-linear relationships between the predictor and the target surprisingly well,
• each basis function is non-zero over a very small interval and is zero everywhere else.  These local basis functions result in a very sparse design matrix (i.e. one with mostly zeros) that leads to much faster computation.

## Applied Statistics Lesson of the Day – Fractional Factorial Design and the Sparsity-of-Effects Principle

Consider again an experiment that seeks to determine the causal relationships between $G$ factors and the response, where $G > 1$.  Ideally, the sample size is large enough for a full factorial design to be used.  However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead.  Such a design assigns the experimental units to a select fraction of the treatments; these treatments are chosen carefully to investigate the most significant causal relationships, while leaving aside the insignificant ones.

When, then, are the significant causal relationships?  According to the sparsity-of-effects principle, it is unlikely that complex, higher-order effects exist, and that the most important effects are the lower-order effects.  Thus, assign the experimental units so that main (1st-order) effects and the 2nd-order interaction effects can be investigated.  This may neglect the discovery of a few significant higher-order effects, but that is the compromise that a fractional factorial design makes when the sample size available is low and the number of possible treatments is high.

## Machine Learning Lesson of the Day – Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

## Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if $X_1, X_2, ..., X_n$ is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.