Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

• $n$ random variables, $X_1, ..., X_n$, are independent and identically distributed,
• $n$ is sufficiently large,

then the distribution of their sample mean, $\bar{X_n}$, is approximately normal, and this approximation is better as $n$ increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of $X_1, ..., X_n$.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

You can also watch the video below the fold!

Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations.  For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Machine Learning Lesson of the Day – Linear Gaussian Basis Function Models

I recently introduced the use of linear basis function models for supervised learning problems that involve non-linear relationships between the predictors and the target.  A common type of basis function for such models is the Gaussian basis function.  This type of model uses the kernel of the normal (or Gaussian) probability density function (PDF) as the basis function.

$\phi_j(x) = exp[-(x - \mu_j)^2 \div 2\sigma^2]$

The $\sigma$ in this basis function determines the spacing between the different basis functions that combine to form the model.

Notice that this is just the normal PDF without the scaling factor of $1/\sqrt{2\pi \sigma^2}$; the scaling factor ensures that the normal PDF integrates to 1 over its support set.  In a linear basis function model, the regression coefficients are the weights for the basis functions, and these weights will scale the normal PDF to fit the data that are local to $\mu_j$.  Thus, there is no need to include that scaling factor of $1/\sqrt{2\pi \sigma^2}$, because the scaling is already being handled by the regression coefficients.

The Gaussian basis function model is useful because

• it can model many non-linear relationships between the predictor and the target surprisingly well,
• each basis function is non-zero over a very small interval and is zero everywhere else.  These local basis functions result in a very sparse design matrix (i.e. one with mostly zeros) that make the computation much faster.

Applied Statistics Lesson of the Day – Fractional Factorial Design and the Sparsity-of-Effects Principle

Consider again an experiment that seeks to determine the causal relationships between $G$ factors and the response, where $G > 1$.  Ideally, the sample size is large enough for a full factorial design to be used.  However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead.  Such a design assigns the experimental units to a select fraction of the treatments; these treatments are chosen carefully to investigate the most significant causal relationships, while leaving aside the insignificant ones.

When, then, are the significant causal relationships?  According to the sparsity-of-effects principle, it is unlikely that complex, higher-order effects exist, and that the most important effects are the lower-order effects.  Thus, assign the experimental units so that main (1st-order) effects and the 2nd-order interaction effects can be investigated.  This may neglect the discovery of a few significant higher-order effects, but that is the compromise that a fractional factorial design makes when the sample size available is low and the number of possible treatments is high.

Machine Learning Lesson of the Day – Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if $X_1, X_2, ..., X_n$ is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

Video Tutorial – The Hazard Function is the Probability Density Function Divided by the Survival Function

In an earlier video, I introduced the definition of the hazard function and broke it down into its mathematical components.  Recall that the definition of the hazard function for events defined on a continuous time scale is

$h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t]$.

Did you know that the hazard function can be expressed as the probability density function (PDF) divided by the survival function?

$h(t) = f(t) \div S(t)$

In my new Youtube video, I prove how this relationship can be obtained from the definition of the hazard function!  I am very excited to post this second video in my new Youtube channel.  You can also view the video below the fold!

Applied Statistics Lesson of the Day – The Independent 2-Sample t-Test with Unequal Variances (Welch’s t-Test)

A common problem in statistics is determining whether or not the means of 2 populations are equal.  The independent 2-sample t-test is a popular parametric method to answer this question.  (In an earlier Statistics Lesson of the Day, I discussed how data collected from a completely randomized design with 1 binary factor can be analyzed by an independent 2-sample t-test.  I also discussed its possible use in the discovery of argon.)  I have learned 2 versions of the independent 2-sample t-test, and they differ on the variances of the 2 samples.  The 2 possibilities are

• equal variances
• unequal variances

Most statistics textbooks that I have read elaborate at length about the independent 2-sample t-test with equal variances (also called Student’s t-test).  However, the assumption of equal variances needs to be checked using the chi-squared test before proceeding with the Student’s t-test, yet this check does not seem to be universally done in practice.  Furthermore, conducting one test based on the results of another results possible inflation of Type 1 error (Ruxton, 2006).

Some books give due attention to the independent 2-sample t-test with unequal variances (also called Welch’s t-test), but some barely mention its value, and others do not even mention it at all.  I find this to be puzzling, because the assumption of equal variances is often violated in practice, and Welch’s t-test provides an easy solution to this problem.  There is a seemingly intimidating but straightforward calculation to approximate the number of degrees of freedom for Welch’s t-test, and this calculation is automatically incorporated in most software, including R and SAS.  Finally, Welch’s t-test removes the need to check for equal variances, and it is almost as powerful as Student’s t-test when the variances are equal (Ruxton, 2006).

For all of these reasons, I recommend Welch’s t-test when using the parametric approach to comparing the means of 2 populations.

Reference

Graeme D. Ruxton.  “The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test“.  Behavioral Ecology (July/August 2006) 17 (4): 688-690 first published online May 17, 2006

Applied Statistics Lesson of the Day – Additive Models vs. Interaction Models in 2-Factor Experimental Designs

In a recent “Machine Learning Lesson of the Day“, I discussed the difference between a supervised learning model in machine learning and a regression model in statistics.  In that lesson, I mentioned that a statistical regression model usually consists of a systematic component and a random component.  Today’s lesson strictly concerns the systematic component.

An additive model is a statistical regression model in which the systematic component is the arithmetic sum of the individual effects of the predictors.  Consider the simple case of an experiment with 2 factors.  If $Y$ is the response and $X_1$ and $X_2$ are the 2 predictors, then an additive linear model for the relationship between the response and the predictors is

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon$

In other words, the effect of $X_1$ on $Y$ does not depend on the value of $X_2$, and the effect of $X_2$ on $Y$ does not depend on the value of $X_1$.

In contrast, an interaction model is a statistical regression model in which the systematic component is not the arithmetic sum of the individual effects of the predictors.  In other words, the effect of $X_1$ on $Y$ depends on the value of $X_2$, or the effect of $X_2$ on $Y$ depends on the value of $X_1$.  Thus, such a regression model would have 3 effects on the response:

1. $X_1$
2. $X_2$
3. the interaction effect of $X_1$ and $X_2$

For example, a full factorial design with 2 factors uses the 2-factor ANOVA model and assumes a linear relationship between the response and the above 3 effects.

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon$

Note that additive models and interaction models are not confined to experimental design; I have merely used experimental design to provide examples for these 2 types of models.

Applied Statistics Lesson of the Day – The Full Factorial Design

An experimenter may seek to determine the causal relationships between $G$ factors and the response, where $G > 1$.  On first instinct, you may be tempted to conduct $G$ separate experiments, each using the completely randomized design with 1 factor.  Often, however, it is possible to conduct 1 experiment with $G$ factors at the same time.  This is better than the first approach because

• it is faster
• it uses less resources to answer the same questions
• the interactions between the $G$ factors can be examined

Such an experiment requires the full factorial design.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.

The simplest full factorial experiment consists of 2 factors, each with 2 levels.  Such an experiment would result in $2 \times 2 = 4$ treatments, each being a combination of 1 level from the first factor and 1 level from the second factor.  Since this is a full factorial design, experimental units are independently assigned to all treatments.  The 2-factor ANOVA model is commonly used to analyze data from such designs.

In later lessons, I will discuss interactions and 2-factor ANOVA in more detail.

Machine Learning Lesson of the Day – Overfitting

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set.  That underlying trend cannot be precisely captured because of the random variation in the data around that trend.  A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation.  An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

• A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target.  If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already.  Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise.  Thus, the new model with these added irrelevant predictors describes the trend and the noise.  It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post.  Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.

Applied Statistics Lesson of the Day – Blocking and the Randomized Complete Blocked Design (RCBD)

A completely randomized design works well for a homogeneous population - one that does not have major differences between any sub-populations.  However, what if a population is heterogeneous?

Consider an example that commonly occurs in medical studies.  An experiment seeks to determine the effectiveness of a drug on curing a disease, and 100 patients are recruited for this double-blinded study – 50 are men, and 50 are women.  An abundance of biological knowledge tells us that men and women have significantly physiologies, and this is a heterogeneous population with respect to gender.  If a completely randomized design is used for this study, gender could be a confounding variable; this is especially true if the experimental group has a much higher proportion of one gender, and the control group has a much higher proportion of the other gender.  (For instance, purely due to the randomness, 45 males may be assigned to the experimental group, and 45 females may be assigned to the control group.)  If a statistically significant difference in the patients’ survival from the disease is observed between such a pair of experimental and control groups, this effect could be attributed to the drug or to gender, and that would ruin the goal of determining the cause-and-effect relationship between the drug and survival from the disease.

To overcome this heterogeneity and control for the effect of gender, a randomized blocked design could be used.  Blocking is the division of the experimental units into homogeneous sub-populations before assigning treatments to them.  A randomized blocked design for our above example would divide the males and females into 2 separate sub-populations, and then each of these 2 groups is split into the experimental and control group.  Thus, the experiment actually has 4 groups:

1. 25 men take the drug (experimental)
2. 25 men take a placebo (control)
3. 25 women take the drug (experimental)
4. 25 women take a placebo (control)

Essentially, the population is divided into blocks of homogeneous sub-populations, and a completely randomized design is applied to each block.  This minimizes the effect of gender on the response and increases the precision of the estimate of the effect of the drug.

Statistics and Chemistry Lesson of the Day – Illustrating Basic Concepts in Experimental Design with the Synthesis of Ammonia

To summarize what we have learned about experimental design in the past few Applied Statistics Lessons of the Day, let’s use an example from physical chemistry to illustrate these basic principles.

Ammonia (NH3) is widely used as a fertilizer in industry.  It is commonly synthesized by the Haber process, which involves a reaction between hydrogen gas and nitrogen gas.

N2 + 3 H2 → 2 NH3   (ΔH = −92.4 kJ·mol−1)

Recall that ΔH is the change in enthalpy.  Under constant pressure (which is the case for most chemical reactions), ΔH is the heat absorbed or released by the system.

Applied Statistics Lesson of the Day – The Completely Randomized Design with 1 Factor

The simplest experimental design is the completely randomized design with 1 factor.  In this design, each experimental unit is randomly assigned to each factor level.  This design is most useful for a homogeneous population (one that does not have major differences between any sub-populations).  It is appealing because of its simplicity and flexibility – it can be used for a factor with any number of levels, and different treatments can have different sample sizes.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.  The means of the response variable in the different groups are compared; if there are significant differences, then there is evidence to suggest that the factor and the response have a causal relationship.  The single-factor analysis of variance (ANOVA) model is most commonly used to analyze the data in such an experiment, but it does assume that the data in each group have a normal distribution, and that all groups have equal variance.  The Kruskal-Wallis test is a non-parametric alternative to ANOVA in analyzing data from single-factor completely randomized experiments.

If the factor has 2 levels, you may think that an independent 2-sample t-test with equal variance can also be used to analyze the data.  This is true, but the square of the t-test statistic in this case is just the F-test statistic in a single-factor ANOVA with 2 groups.  Thus, the results of these 2 tests are the same.  ANOVA generalizes the independent 2-sample t-test with equal variance to more than 2 groups.

Some textbooks state that “random assignment” means random assignment of experimental units to treatments, whereas other textbooks state that it means random assignment of treatments to experimental units.  I don’t think that there is any difference between these 2 definitions, but I welcome your thoughts in the comments.

Applied Statistics Lesson of the Day – Positive Control in Experimental Design

In my recent lesson on controlling for confounders in experimental design, the control group was described as one that received a neutral or standard treatment, and the standard treatment may simply be nothing.  This is a negative control group.  Not all experiments require a negative control group; some experiments instead have positive control group.

A positive control group is a group of experimental units that receive a treatment that is known to cause an effect on the response.  Such a causal relationship would have been previously established, and its inclusion in the experiment allows a new treatment to be compared to this existing treatment.  Again, both the positive control group and the experimental group experience the same experimental procedures and conditions except for the treatment.  The existing treatment with the known effect on the response is applied to the positive control group, and the new treatment with the unknown effect on the response is applied to the experimental group.  If the new treatment has a causal relationship with the response, both the positive control group and the experimental group should have the same responses.  (This assumes, of course, that the response can only be changed in 1 direction.  If the response can increase or decrease in value (or, more generally, change in more than 1 way), then it is possible for the positive control group and the experimental group to have the different responses.

In short, in an experiment with a positive control group, an existing treatment is known to “work”, and the new treatment is being tested to see if it can “work” just as well or even better.  Experiments to test for the effectiveness of a new medical therapies or a disease detector often have positive controls; there are existing therapies or detectors that work well, and the new therapy or detector is being evaluated for its effectiveness.

Experiments with positive controls are useful for ensuring that the experimental procedures and conditions proceed as planned.  If the positive control does not show the expected response, then something is wrong with the experimental procedures or conditions, and any “good” result from the new treatment should be considered with skepticism.

Machine Learning Lesson of the Day – Supervised Learning: Classification and Regression

Supervised learning has 2 categories:

• In classification, the target variable is categorical.
• In regression, the target variable is continuous.

Thus, regression in statistics is different from regression in supervised learning.

In statistics,

• regression is used to model relationships between predictors and targets, and the targets could be continuous or categorical.
• a regression model usually includes 2 components to describe such relationships:
• a systematic component
• a random component.  The random component of this relationship is mathematically described by some probability distribution.
• most regression models in statistics also have assumptions about the between the predictors and/or between the observations.
• many statistical models also aim to provide interpretable relationships between the predictors and targets.
• For example, in simple linear regression, the slope parameter, $\beta_1$, predicts the change in the target, $Y$, for every unit increase in the predictor, $X$.

In supervised learning,

• target variables in regression must be continuous
• categorical target variables are modelled in classification
• regression has less or even no emphasis on using probability to describe the random variation between the predictor and the target
• Random forests are powerful tools for both classification and regression, but they do not use probability to describe the relationship between the predictors and the target.
• regression has less or even no emphasis on providing interpretable relationships between the predictors and targets.
• Neural networks are powerful tools for both classification and regression, but they do not provide interpretable relationships between the predictors and the target.

***The last 2 points are applicable to classification, too.

In general, supervised learning puts much more emphasis on accurate prediction than statistics.

Since regression in supervised learning includes only continuous targets, this results in some confusing terminology between the 2 fields.  For example, logistic regression is a commonly used technique in both statistics and supervised learning.  However, despite its name, it is a classification technique in supervised learning, because the response variable in logistic regression is categorical.

Applied Statistics Lesson of the Day – Basic Terminology in Experimental Design #1

Experiment: A procedure to determine the causal relationship between 2 variables – an explanatory variable and a response variable.  The value of the explanatory variable is changed, and the value of the response variable is observed for each value of the explantory variable.

• An experiment can have 2 or more explanatory variables and 2 or more response variables.
• In my experience, I find that most experiments have 1 response variable, but many experiments have 2 or more explanatory variables.  The interactions between the multiple explanatory variables are often of interest.
• All other variables are held constant in this process to avoid confounding.

Explanatory Variable or Factor: The variable whose values are set by the experimenter.  This variable is the cause in the hypothesis.  (*Many people call this the independent variable.  I discourage this usage, because “independent” means something very different in statistics.)

Response Variable: The variable whose values are observed by the experimenter as the explanatory variable’s value is changed.  This variable is the effect in the hypothesis.  (*Many people call this the dependent variable.  Further to my previous point about “independent variables”, dependence means something very different in statistics, and I discourage using this usage.)

Factor Level: Each possible value of the factor (explanatory variable).  A factor must have at least 2 levels.

Treatment: Each possible combination of factor levels.

• If the experiment has only 1 explanatory variable, then each treatment is simply each factor level.
• If the experiment has 2 explanatory variables, X and Y, then each treatment is a combination of 1 factor level from X and 1 factor level from Y.  Such combining of factor levels generalizes to experiments with more than 2 explanatory variables.

Experimental Unit: The object on which a treatment is applied.  This can be anything – person, group of people, animal, plant, chemical, guitar, baseball, etc.

Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

Introduction

Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) probability density function actually integrates to 1 over its support set.  Fully commented and readily usable R code will be provided at the end.

Given a probability density function (PDF) and its support set as vectors in an array programming language like R, how do you integrate the PDF over its support set to ensure that it equals to 1?  Read the rest of this post to view my own R function to implement trapezoidal integration and learn how to use it to numerically approximate integrals.

Detecting Unfair Dice in Casinos with Bayes’ Theorem

Introduction

I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook.  I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively.

The Problem

The following question is a slightly modified version of Exercise #1.2 on Page 8 in “Biological Sequence Analysis” by Durbin, Eddy, Krogh and Mitchison.

An occasionally dishonest casino uses 2 types of dice.  Of its dice, 97% are fair but 3% are unfair, and a “five” comes up 35% of the time for these unfair dice.  If you pick a die randomly and roll it, how many “fives”  in a row would you need to see before it was most likely that you had picked an unfair die?”

Read more to learn how to create the following plot and how it invokes Bayes’ Theorem to solve the above problem!

Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

Introduction

Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.

Previous posts in this series on EDA include

Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog!