## Sorting correlation coefficients by their magnitudes in a SAS macro

#### Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, $X$ and $Y$, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations.  Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

1. computing the correlation coefficients
2. obtaining their absolute values
3. sorting them by their absolute values.

## Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips.  I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance.  This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out!  I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series!  It is a pleasure to be a guest blogger for JMP!

## Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

I am very excited to write again for the official JMP blog as a guest blogger!  Today, the first article of a 2-part series has been published, and it is called “Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP“.  This series of blog posts will talk about analysis of variance (ANOVA), sampling, and analytical chemistry, and it uses the quantification of sodium in potato chips as an example to illustrate these concepts.

The first part of this series discusses how to import the data into the JMP and prepare them for ANOVA.  Specifically, it illustrates how the “Stack Columns” function is used to transpose the data from wide format to long format.

I will present this at the Vancouver SAS User Group (VanSUG) meeting later today.

## Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

$P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]$

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

$P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)$

$P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757$

Now, let’s connect this probability to odds.

It is important to note that

• odds is only applicable to Bernoulli random variables (i.e. binary events)
• odds is the ratio of the probability of success to the probability of failure

For our example,

$\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185$

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

## Mathematical Statistics Lesson of the Day – Basu’s Theorem

Today’s Statistics Lesson of the Day will discuss Basu’s theorem, which connects the previously discussed concepts of minimally sufficient statistics, complete statistics and ancillary statistics.  As before, I will begin with the following set-up.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF) or probability mass function (PMF) for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistics based on $\textbf{X}$.

Basu’s theorem states that, if $T(\textbf{X})$ is a complete and minimal sufficient statistic, then $T(\textbf{X})$ is independent of every ancillary statistic.

Establishing the independence between 2 random variables can be very difficult if their joint distribution is hard to obtain.  This theorem allows the independence between minimally sufficient statistic and every ancillary statistic to be established without their joint distribution – and this is the great utility of Basu’s theorem.

However, establishing that a statistic is complete can be a difficult task.  In a later lesson, I will discuss another theorem that will make this task easier for certain cases.

## Mathematical Statistics Lesson of the Day – An Example of An Ancillary Statistic

Consider 2 random variables, $X_1$ and $X_2$, from the normal distribution $\text{Normal}(\mu, \sigma^2)$, where $\mu$ is unknown.  Then the statistic

$D = X_1 - X_2$

has the distribution

$\text{Normal}(0, 2\sigma^2)$.

The distribution of $D$ does not depend on $\mu$, so $D$ is an ancillary statistic for $\mu$.

Note that, if $\sigma^2$ is unknown, then $D$ is not ancillary for $\sigma^2$.

## Mathematical Statistics Lesson of the Day – Ancillary Statistics

The set-up for today’s post mirrors my earlier Statistics Lessons of the Day on sufficient statistics and complete statistics.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF) or probability mass function (PMF) for $X_1, X_2, ..., X_n$.

Let

$a = A(\mathbf{X})$

be a statistics based on $\textbf{X}$.

If the distribution of $A(\textbf{X})$ does NOT depend on $\theta$, then $A(\textbf{X})$ is called an ancillary statistic.

An ancillary statistic contains no information about $\theta$; its distribution is fixed and known without any relation to $\theta$.  Why, then, would we care about $A(\textbf{X})$  I will address this question in later Statistics Lessons of the Day, and I will connect ancillary statistics to sufficient statistics, minimally sufficient statistics and complete statistics.

## Mathematics and Applied Statistics Lesson of the Day – Contrasts

A contrast is a linear combination of a set of variables such that the sum of the coefficients is equal to zero.  Notationally, consider a set of variables

$\mu_1, \mu_2, ..., \mu_n$.

Then the linear combination

$c_1 \mu_1 + c_2 \mu_2 + ... + c_n \mu_n$

is a contrast if

$c_1 + c_2 + ... + c_n = 0$.

There is a reason for why I chose to use $\mu$ as the symbol for the variables in the above notation – in statistics, contrasts provide a very useful framework for comparing multiple population means in hypothesis testing.  In a later Statistics Lesson of the Day, I will illustrate some examples of contrasts, especially in the context of experimental design.

## The advantages of using count() to get N-way frequency tables as data frames in R

#### Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R.  Several commenters provided alternative ways of doing so, and they are all appreciated.  Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

#### 2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation.  Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
gear
cyl      3     4     5
4        1     8     2
6        2     4     1
8        12    0     2

## How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

#### Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions.  For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate.  This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this.  Sadly, I could not find such a function.  In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

#### The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example.  The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear.  We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

## Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples.  Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis!  Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

I will return to this blog post to add new links as I write more tutorials.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

## Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.

If

$E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,$

implies that

$P \{g[T(\mathbf{X})]\} = 0] = 1,$

then $T(\mathbf{X})$ is said to be complete.  To deconstruct this esoteric mathematical statement,

1. let $g(t)$ be a measurable function
2. if you want to use $g[T(\mathbf{X})]$ to form an unbiased estimator of the zero function,
3. and if the only such function is almost surely equal to the zero function,
4. then $T(\mathbf{X})$ is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

## Performing Logistic Regression in R and SAS

#### Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

#### The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
2. treatment – Whether or not the patient completed an anger control treatment program.
3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

## Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

## Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, $X_1, X_2, ..., X_n$, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all $n$ variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, $T(\textbf{X})$, such that

1. $T(\textbf{X})$ is a sufficient statistic
2. if $U(\textbf{X})$ is any other sufficient statistic, then there exists a function $g$ such that

$T(\textbf{X}) = g[U(\textbf{X})].$

Note that, if there exists a one-to-one function $h$ such that

$T(\textbf{X}) = h[U(\textbf{X})],$

then $T(\textbf{X})$ and $U(\textbf{X})$ are equivalent.

## Mathematical Statistics Lesson of the Day – Sufficient Statistics

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.  Let $g_\theta(t)$ be the PDF for $T(X)$.

If the conditional PDF

$h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]$

is independent of $\theta$, then $T(\mathbf{X})$ is a sufficient statistic for $\theta$.  In other words,

$h_\theta(\mathbf{X}) = h(\mathbf{X})$,

and $\theta$ does not appear in $h(\mathbf{X})$.

Intuitively, this means that $T(\mathbf{X})$ contains everything you need to estimate $\theta$, so knowing $T(\mathbf{X})$ (i.e. conditioning $f_\theta(x)$ on $T(\mathbf{X})$) is sufficient for estimating $\theta$.

Often, the sufficient statistic for $\theta$ is a summary statistic of $X_1, X_2, ..., X_n$, such as their

• sample mean
• sample median – removed thanks to comment by Christian Robert (Xi’an)
• sample minimum
• sample maximum

If such a summary statistic is sufficient for $\theta$, then knowing this one statistic is just as useful as knowing all $n$ data for estimating $\theta$.

*This above definition holds for discrete and continuous random variables.

## Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

#### Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

## Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function $f(x)$ that is continuous on the interval $[x_1, x_2]$, where $x_1$ and $x_2$ are any 2 points in the domain of $f(x)$.  Let

$x_m = 0.5x_1 + 0.5x_2$

be the midpoint of $x_1$ and $x_2$.  Then, if

$f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),$

then $f(x)$ is defined to be midpoint convex.

More generally, let’s consider any point within the interval $[x_1, x_2]$.  We can denote this arbitrary point as

$x_\lambda = \lambda x_1 + (1 - \lambda)x_2,$ where $0 < \lambda < 1$.

Then, if

$f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be convex.  If

$f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable $Y$ with a finite expected value and for any convex function $g(y)$,

$E[g(Y)] \geq g[E(Y)]$.

A function $f(x)$ is defined to be concave if $-f(x)$ is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable $Z$ with a finite expected value and for any concave function $h(z)$,

$E[h(Z)] \leq h[E(Z)]$.

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

## Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of $n$ independent and identically distributed random variables, $X_1, X_2, ..., X_n$,

$P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.$

In other words, the empirical CDF of $X_1, X_2, ..., X_n$ converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

Markov’s inequality roughly says that a random variable $X$ is most frequently observed near its expected value, $\mu$.  Remarkably, it quantifies just how often $X$ is far away from $\mu$.  Chebyshev’s inequality goes one step further and quantifies that distance between $X$ and $\mu$ in terms of the number of standard deviations away from $\mu$.  It roughly says that the probability of $X$ being $k$ standard deviations away from $\mu$ is at most $k^{-2}$.  Notice that this upper bound decreases as $k$ increases – confirming our intuition that it is highly improbable for $X$ to be far away from $\mu$.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable $X$, as long as $E(X)$ and $V(X)$ are finite.  (Markov’s inequality requires only $E(X)$ to be finite.)  This is quite a marvelous result!