## Applied Statistics Lesson of the Day – Additive Models vs. Interaction Models in 2-Factor Experimental Designs

In a recent “Machine Learning Lesson of the Day“, I discussed the difference between a supervised learning model in machine learning and a regression model in statistics.  In that lesson, I mentioned that a statistical regression model usually consists of a systematic component and a random component.  Today’s lesson strictly concerns the systematic component.

An additive model is a statistical regression model in which the systematic component is the arithmetic sum of the individual effects of the predictors.  Consider the simple case of an experiment with 2 factors.  If $Y$ is the response and $X_1$ and $X_2$ are the 2 predictors, then an additive linear model for the relationship between the response and the predictors is

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon$

In other words, the effect of $X_1$ on $Y$ does not depend on the value of $X_2$, and the effect of $X_2$ on $Y$ does not depend on the value of $X_1$.

In contrast, an interaction model is a statistical regression model in which the systematic component is not the arithmetic sum of the individual effects of the predictors.  In other words, the effect of $X_1$ on $Y$ depends on the value of $X_2$, or the effect of $X_2$ on $Y$ depends on the value of $X_1$.  Thus, such a regression model would have 3 effects on the response:

1. $X_1$
2. $X_2$
3. the interaction effect of $X_1$ and $X_2$

For example, a full factorial design with 2 factors uses the 2-factor ANOVA model and assumes a linear relationship between the response and the above 3 effects.

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon$

Note that additive models and interaction models are not confined to experimental design; I have merely used experimental design to provide examples for these 2 types of models.

## Applied Statistics Lesson of the Day – The Full Factorial Design

An experimenter may seek to determine the causal relationships between $G$ factors and the response, where $G > 1$.  On first instinct, you may be tempted to conduct $G$ separate experiments, each using the completely randomized design with 1 factor.  Often, however, it is possible to conduct 1 experiment with $G$ factors at the same time.  This is better than the first approach because

• it is faster
• it uses less resources to answer the same questions
• the interactions between the $G$ factors can be examined

Such an experiment requires the full factorial design.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.

The simplest full factorial experiment consists of 2 factors, each with 2 levels.  Such an experiment would result in $2 \times 2 = 4$ treatments, each being a combination of 1 level from the first factor and 1 level from the second factor.  Since this is a full factorial design, experimental units are independently assigned to all treatments.  The 2-factor ANOVA model is commonly used to analyze data from such designs.

In later lessons, I will discuss interactions and 2-factor ANOVA in more detail.

## Machine Learning Lesson of the Day – Overfitting

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set.  That underlying trend cannot be precisely captured because of the random variation in the data around that trend.  A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation.  An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

• A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target.  If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already.  Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise.  Thus, the new model with these added irrelevant predictors describes the trend and the noise.  It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post.  Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.

## Applied Statistics Lesson of the Day – Blocking and the Randomized Complete Blocked Design (RCBD)

A completely randomized design works well for a homogeneous population - one that does not have major differences between any sub-populations.  However, what if a population is heterogeneous?

Consider an example that commonly occurs in medical studies.  An experiment seeks to determine the effectiveness of a drug on curing a disease, and 100 patients are recruited for this double-blinded study – 50 are men, and 50 are women.  An abundance of biological knowledge tells us that men and women have significantly physiologies, and this is a heterogeneous population with respect to gender.  If a completely randomized design is used for this study, gender could be a confounding variable; this is especially true if the experimental group has a much higher proportion of one gender, and the control group has a much higher proportion of the other gender.  (For instance, purely due to the randomness, 45 males may be assigned to the experimental group, and 45 females may be assigned to the control group.)  If a statistically significant difference in the patients’ survival from the disease is observed between such a pair of experimental and control groups, this effect could be attributed to the drug or to gender, and that would ruin the goal of determining the cause-and-effect relationship between the drug and survival from the disease.

To overcome this heterogeneity and control for the effect of gender, a randomized blocked design could be used.  Blocking is the division of the experimental units into homogeneous sub-populations before assigning treatments to them.  A randomized blocked design for our above example would divide the males and females into 2 separate sub-populations, and then each of these 2 groups is split into the experimental and control group.  Thus, the experiment actually has 4 groups:

1. 25 men take the drug (experimental)
2. 25 men take a placebo (control)
3. 25 women take the drug (experimental)
4. 25 women take a placebo (control)

Essentially, the population is divided into blocks of homogeneous sub-populations, and a completely randomized design is applied to each block.  This minimizes the effect of gender on the response and increases the precision of the estimate of the effect of the drug.

## Statistics and Chemistry Lesson of the Day – Illustrating Basic Concepts in Experimental Design with the Synthesis of Ammonia

To summarize what we have learned about experimental design in the past few Applied Statistics Lessons of the Day, let’s use an example from physical chemistry to illustrate these basic principles.

Ammonia (NH3) is widely used as a fertilizer in industry.  It is commonly synthesized by the Haber process, which involves a reaction between hydrogen gas and nitrogen gas.

N2 + 3 H2 → 2 NH3   (ΔH = −92.4 kJ·mol−1)

Recall that ΔH is the change in enthalpy.  Under constant pressure (which is the case for most chemical reactions), ΔH is the heat absorbed or released by the system.

## Applied Statistics Lesson of the Day – The Completely Randomized Design with 1 Factor

The simplest experimental design is the completely randomized design with 1 factor.  In this design, each experimental unit is randomly assigned to each factor level.  This design is most useful for a homogeneous population (one that does not have major differences between any sub-populations).  It is appealing because of its simplicity and flexibility – it can be used for a factor with any number of levels, and different treatments can have different sample sizes.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.  The means of the response variable in the different groups are compared; if there are significant differences, then there is evidence to suggest that the factor and the response have a causal relationship.  The single-factor analysis of variance (ANOVA) model is most commonly used to analyze the data in such an experiment, but it does assume that the data in each group have a normal distribution, and that all groups have equal variance.  The Kruskal-Wallis test is a non-parametric alternative to ANOVA in analyzing data from single-factor completely randomized experiments.

If the factor has 2 levels, you may think that an independent 2-sample t-test with equal variance can also be used to analyze the data.  This is true, but the square of the t-test statistic in this case is just the F-test statistic in a single-factor ANOVA with 2 groups.  Thus, the results of these 2 tests are the same.  ANOVA generalizes the independent 2-sample t-test with equal variance to more than 2 groups.

Some textbooks state that “random assignment” means random assignment of experimental units to treatments, whereas other textbooks state that it means random assignment of treatments to experimental units.  I don’t think that there is any difference between these 2 definitions, but I welcome your thoughts in the comments.

## Applied Statistics Lesson of the Day – Positive Control in Experimental Design

In my recent lesson on controlling for confounders in experimental design, the control group was described as one that received a neutral or standard treatment, and the standard treatment may simply be nothing.  This is a negative control group.  Not all experiments require a negative control group; some experiments instead have positive control group.

A positive control group is a group of experimental units that receive a treatment that is known to cause an effect on the response.  Such a causal relationship would have been previously established, and its inclusion in the experiment allows a new treatment to be compared to this existing treatment.  Again, both the positive control group and the experimental group experience the same experimental procedures and conditions except for the treatment.  The existing treatment with the known effect on the response is applied to the positive control group, and the new treatment with the unknown effect on the response is applied to the experimental group.  If the new treatment has a causal relationship with the response, both the positive control group and the experimental group should have the same responses.  (This assumes, of course, that the response can only be changed in 1 direction.  If the response can increase or decrease in value (or, more generally, change in more than 1 way), then it is possible for the positive control group and the experimental group to have the different responses.

In short, in an experiment with a positive control group, an existing treatment is known to “work”, and the new treatment is being tested to see if it can “work” just as well or even better.  Experiments to test for the effectiveness of a new medical therapies or a disease detector often have positive controls; there are existing therapies or detectors that work well, and the new therapy or detector is being evaluated for its effectiveness.

Experiments with positive controls are useful for ensuring that the experimental procedures and conditions proceed as planned.  If the positive control does not show the expected response, then something is wrong with the experimental procedures or conditions, and any “good” result from the new treatment should be considered with skepticism.

## Machine Learning Lesson of the Day – Classification and Regression

Supervised learning has 2 categories:

• In classification, the target variable is categorical.
• In regression, the target variable is continuous.

Thus, regression in statistics is different from regression in supervised learning.

In statistics,

• regression is used to model relationships between predictors and targets, and the targets could be continuous or categorical.
• a regression model usually includes 2 components to describe such relationships:
• a systematic component
• a random component.  The random component of this relationship is mathematically described by some probability distribution.
• most regression models in statistics also have assumptions about the between the predictors and/or between the observations.
• many statistical models also aim to provide interpretable relationships between the predictors and targets.
• For example, in simple linear regression, the slope parameter, $\beta_1$, predicts the change in the target, $Y$, for every unit increase in the predictor, $X$.

In supervised learning,

• target variables in regression must be continuous
• categorical target variables are modelled in classification
• regression has less or even no emphasis on using probability to describe the random variation between the predictor and the target
• Random forests are powerful tools for both classification and regression, but they do not use probability to describe the relationship between the predictors and the target.
• regression has less or even no emphasis on providing interpretable relationships between the predictors and targets.
• Neural networks are powerful tools for both classification and regression, but they do not provide interpretable relationships between the predictors and the target.

***The last 2 points are applicable to classification, too.

In general, supervised learning puts much more emphasis on accurate prediction than statistics.

Since regression in supervised learning includes only continuous targets, this results in some confusing terminology between the 2 fields.  For example, logistic regression is a commonly used technique in both statistics and supervised learning.  However, despite its name, it is a classification technique in supervised learning, because the response variable in logistic regression is categorical.

## Applied Statistics Lesson of the Day – Basic Terminology in Experimental Design #1

Experiment: A procedure to determine the causal relationship between 2 variables – an explanatory variable and a response variable.  The value of the explanatory variable is changed, and the value of the response variable is observed for each value of the explantory variable.

• An experiment can have 2 or more explanatory variables and 2 or more response variables.
• In my experience, I find that most experiments have 1 response variable, but many experiments have 2 or more explanatory variables.  The interactions between the multiple explanatory variables are often of interest.
• All other variables are held constant in this process to avoid confounding.

Explanatory Variable or Factor: The variable whose values are set by the experimenter.  This variable is the cause in the hypothesis.  (*Many people call this the independent variable.  I discourage this usage, because “independent” means something very different in statistics.)

Response Variable: The variable whose values are observed by the experimenter as the explanatory variable’s value is changed.  This variable is the effect in the hypothesis.  (*Many people call this the dependent variable.  Further to my previous point about “independent variables”, dependence means something very different in statistics, and I discourage using this usage.)

Factor Level: Each possible value of the factor (explanatory variable).  A factor must have at least 2 levels.

Treatment: Each possible combination of factor levels.

• If the experiment has only 1 explanatory variable, then each treatment is simply each factor level.
• If the experiment has 2 explanatory variables, X and Y, then each treatment is a combination of 1 factor level from X and 1 factor level from Y.  Such combining of factor levels generalizes to experiments with more than 2 explanatory variables.

Experimental Unit: The object on which a treatment is applied.  This can be anything – person, group of people, animal, plant, chemical, guitar, baseball, etc.

## Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

#### Introduction

Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) probability density function actually integrates to 1 over its support set.  Fully commented and readily usable R code will be provided at the end.

Given a probability density function (PDF) and its support set as vectors in an array programming language like R, how do you integrate the PDF over its support set to ensure that it equals to 1?  Read the rest of this post to view my own R function to implement trapezoidal integration and learn how to use it to numerically approximate integrals.

## Detecting Unfair Dice in Casinos with Bayes’ Theorem

#### Introduction

I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook.  I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively.

#### The Problem

The following question is a slightly modified version of Exercise #1.2 on Page 8 in “Biological Sequence Analysis” by Durbin, Eddy, Krogh and Mitchison.

An occasionally dishonest casino uses 2 types of dice.  Of its dice, 97% are fair but 3% are unfair, and a “five” comes up 35% of the time for these unfair dice.  If you pick a die randomly and roll it, how many “fives”  in a row would you need to see before it was most likely that you had picked an unfair die?”

Read more to learn how to create the following plot and how it invokes Bayes’ Theorem to solve the above problem!

## Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

#### Introduction

Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.

Previous posts in this series on EDA include

Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog!

## Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

#### Introduction

Data in R are often stored in data frames, because they can store multiple types of data.  (In R, data frames are more general than matrices, because matrices can only store one type of data.)  Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis.  I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.

If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!

This post continues a recent series on exploratory data analysis.  Previous posts in this series include

#### Useful Functions for Exploring Data Frames

Use dim() to obtain the dimensions of the data frame (number of rows and number of columns).  The output is a vector.

> dim(InsectSprays)
[1] 72 2

Use nrow() and ncol() to get the number of rows and number of columns, respectively.  You can get the same information by extracting the first and second element of the output vector from dim().

> nrow(InsectSprays)
# same as dim(InsectSprays)[1]
[1] 72
> ncol(InsectSprays)
# same as dim(InsectSprays)[2]
[1] 2

## Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series.  I will define and calculate the 5-number summary in 2 different ways that are commonly used in R.  (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.)  I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function.  This post expands on a recent comment that I made to correct an error in the post on box plots.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
Min.   1st Qu.   Median    Mean     3rd Qu.    Max.
1.0     3.5       6.0       6.0      8.5       11.0

Why do these 2 methods of calculating the 5–number summary in R give different results?  Read the rest of this post to find out the answer!

Previous posts in this series on EDA include

## Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

#### Introduction

This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots.  I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA).  Previous posts in this series on EDA include

## Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set.  I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.  In a later post, I will assess the distribution of the ”Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!

## Exploratory Data Analysis – Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

Update on July 15, 2013:

Thanks to Harlan Nelson for noting on AnalyticBridge that the ozone concentrations for both New York and Ozonopolis are non-negative quantities, so their kernel density plot should have non-negative support sets.  This has been corrected in this post by

- defining new variables called max.ozone and max.ozone2

- using the options “from = 0″ and “to = max.ozone” or “to = max.ozone2″ in the density() function when defining density.ozone and density.ozone2 in the R code.

Update on February 2, 2014:

Harlan also noted in the above comment that any truncated kernel density estimator (KDE) from density() in R does not integrate to 1 over its support set.  Thanks to Julian Richer Daily for suggesting on AnalyticBridge to scale any truncated kernel density estimator (KDE) from density() by its integral to get a KDE that integrates to 1 over its support set.  I have used my own function for trapezoidal integration to do so, and this has been added below.

I thank everyone for your patience while I took the time to write a post about numerical integration before posting this correction.  I was in the process of moving between jobs and cities when Harlan first brought this issue to my attention, and I had also been planning a major expansion of this blog since then.  I am glad that I have finally started a series on numerical integration to provide the conceptual background for the correction of this error, and I hope that they are helpful.  I recognize that this is a rather late correction, and I apologize for any confusion.

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first half focused on the conceptual foundations of kernel density estimation.

#### Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation.  It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R.  It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

This is another post in a recent series on exploratory data analysis, which has included posts on descriptive statistics, box plots, violin plots, the conceptual foundations of empirical cumulative distribution functions (CDFs), and how to plot empirical CDFs in R.

Read the rest of this post to learn how to create the above combination of a kernel density plot and a rug plot!

## Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), and following up on the last post on the conceptual foundations of empirical cumulative distribution functions (CDFs), this post shows how to plot them in R.  (Previous posts in this series on EDA include descriptive statistics, box plots, kernel density estimation, and violin plots.)

I will plot empirical CDFs in 2 ways:

1. using the built-in ecdf() and plot() functions in R
2. calculating and plotting the cumulative probabilities against the ordered data

Continuing from the previous posts in this series on EDA, I will use the “Ozone” data from the built-in “airquality” data set in R.  Recall that this data set has missing values, and, just as before, this problem needs to be addressed when constructing plots of the empirical CDFs.

Recall the plot of the empirical CDF of random standard normal numbers in my earlier post on the conceptual foundations of empirical CDFs.  That plot will be compared to the plots of the empirical CDFs of the ozone data to check if they came from a normal distribution.

## Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

## Exploratory Data Analysis: Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

#### Introduction

Recently, I began a series on exploratory data analysis (EDA), and I have written about descriptive statistics, box plots, and kernel density plots so far.  As previously mentioned in my post on box plots, there is a way to combine box plots and kernel density plots.  This combination results in violin plots, and I will show how to create them in R today.

Continuing from my previous posts on EDA, I will use 2 univariate data sets.  One is the “ozone” data vector that is part of the “airquality” data set that is built into R; this data set contains data on New York’s air pollution.  The other is a simulated data set of ozone pollution in a fictitious city called “Ozonopolis”.  It is important to remember that the ozone data from New York has missing values, and this has created complications that needed to be addressed in previous posts; missing values need to be addressed for violin plots, too, and in a different way than before.

The vioplot() command in the “vioplot” package creates violin plots; the plotting options in this function are different and less versatile than other plotting functions that I have used in R.  Thus, I needed to be more creative with the plot(), title(), and axis() functions to create the plots that I want.  Read the details carefully to understand and benefit fully from the code.

Read further to learn how to create these violin plots that combine box plots with kernel density plots!  Be careful – the syntax is more complicated than usual!