The advantages of using count() to get N-way frequency tables as data frames in R

Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R.  Several commenters provided alternative ways of doing so, and they are all appreciated.  Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

 

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation.  Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
          gear
 cyl      3     4     5
 4        1     8     2
 6        2     4     1
 8        12    0     2

Read more of this post

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions.  For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate.  This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this.  Sadly, I could not find such a function.  In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example.  The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear.  We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

Read more of this post

Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples.  Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis!  Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

I will return to this blog post to add new links as I write more tutorials.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

Performing Logistic Regression in R and SAS

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

  1. ha2  - Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
  2. treatment – Whether or not the patient completed an anger control treatment program.
  3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Read more of this post

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

P[|X - \mu| \geq k \sigma] \leq 1 \div k^2

Markov’s inequality roughly says that a random variable X is most frequently observed near its expected value, \mu.  Remarkably, it quantifies just how often X is far away from \mu.  Chebyshev’s inequality goes one step further and quantifies that distance between X and \mu in terms of the number of standard deviations away from \mu.  It roughly says that the probability of X being k standard deviations away from \mu is at most k^{-2}.  Notice that this upper bound decreases as k increases – confirming our intuition that it is highly improbable for X to be far away from \mu.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable X, as long as E(X) and V(X) are finite.  (Markov’s inequality requires only E(X) to be finite.)  This is quite a marvelous result!

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

P(X \geq c) \leq E(X) \div c, where c > 0.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable X “far” away from its “centre” or “central value”?

Intuitively, the “central value” of X is the value that of X that is most commonly (or most frequently) observed.  Thus, as X deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, E(X), is a measure of the “centre” of X.  Thus, we would expect that the probability of X being very far away from E(X) is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As c becomes really far away from E(X), the event X \geq c becomes less probable.

You can confirm this by substituting several key values of c.

 

  • If c = E(X), then P[X \geq E(X)] \leq 1; this is the highest upper bound that P(X \geq c) can get.  This makes intuitive sense; X is going to be frequently observed near its own expected value.

 

  • If c \rightarrow \infty, then P(X \geq \infty) \leq 0.  By Kolmogorov’s axioms of probability, any probability must be inclusively between 0 and 1, so P(X \geq \infty) = 0.  This makes intuitive sense; there is no possible way that X can be bigger than positive infinity.

The Chi-Squared Test of Independence – An Example in Both R and SAS

Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, X and Y, the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and alternative hypotheses:

H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities.  For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

Read more of this post

Applied Statistics Lesson of the Day – The Coefficient of Variation

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set.  However, consider the following 2 hypothetical cases:

  1. the standard deviation for the incomes of households in Canada is $2,000
  2. the standard deviation for the incomes of the 5 major banks in Canada is $2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas $2,000 is not a lot of money for one of the “Big Five” banks.  Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets.  One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},

where s is the sample standard deviation and \bar{x} is the sample mean.

Analogously, the coefficient of variation for a random variable is

\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,

where \sigma is the random variable’s standard deviation and \mu is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes.  I hope that all new statistics students get to learn this alternative measure of dispersion.

Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

  • its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}

  • its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals.  (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.)  Today, I want to share a second video of calculating expected counts – this time, using joint probabilities.  This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table.  Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample.  This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my Youtube channel to gain a deeper understanding of how and why they can be calculated.  Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

You can also watch the video below the fold!

Read more of this post

Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables.  The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily.  In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals.  In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

You can also watch the video below the fold!

Read more of this post

Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

  • 5% in the 1st year
  • 10% in the 2nd year
  • 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of 1 + r, where r is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, G.  The average factor of growth from the 3 years is

G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n},

where r_i is the rate of return in year i, i = 1, 2, 3, ..., n.  The average annual rate of return is G - 1.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

 

Returning to our example, our average factor of growth is

G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242.

Thus, our annual rate of return is G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%.

 

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

  1. The first set, S_1, consists of your data x_1, x_2, ..., x_n, and this set has a sample size of n.
  2. The second, S_2,  set also has a sample size of n, but all n values are the same – let’s call this common value y.
  • What number must y be such that the sums in S_1 and S_2 are equal?  This value of y is the arithmetic mean of the first set.
  • What number must y be such that the products in S_1 and S_2 are equal?  This value of y is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i).

 

For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

  1. 60 km/hr for 100 km
  2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]

= 46.45 \text{ km/hr}

 

Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}

Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of

\text{Sensitivity}

on the vertical axis and

1 - \text{Specificity}

on the horizontal axis, the line of no discrimination is the line that passes through the points

(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)

and

(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1).

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution \text{Bernoulli}(0.5) is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of 1 - \text{Specificity}, the \text{Sensitivity} of the binary classifier is higher than the \text{Sensitivity} of the line of no discrimination.)

Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

y = mx + b,

with y being called the dependent variable and x being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables X_1 and X_2 are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for Y and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for x_1, x_2, .., x_p.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

Applied Statistics Lesson of the Day – Polynomial Regression is Actually Just Linear Regression

Continuing from my previous Statistics Lesson of the Day on what “linear” really means in “linear regression”, I want to highlight a common example involving this nomenclature that can mislead non-statisticians.  Polynomial regression is a commonly used multiple regression technique; it models the systematic component of the regression model as a p\text{th}-order polynomial relationship between the response variable Y and the explanatory variable x.

Y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_p x^p + \varepsilon

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients.  The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

The harmonic mean, H, for n positive real numbers x_1, x_2, ..., x_n is defined as

H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}.

This type of mean is useful for measuring the average of rates.  For example, consider a car travelling for 240 kilometres at 2 different speeds:

  1. 60 km/hr for 120 km
  2. 40 km/hr for another 120 km

Then its average speed for this trip is

S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds.  If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics.  The average speed is defined as

S_{avg} = \Delta \text{distance} \div \Delta \text{time}

We already have the elapsed distance – it’s 240 km.  Let’s find the time elapsed for this trip.

\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})

\Delta \text{time} = 5 \text{ hours}

Thus,

S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

 

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Can Apply to the Sum

The central limit theorem (CLT) is often stated in terms of the sample mean of independent and identically distributed random variables.  An often unnoticed or forgotten aspect of the CLT is its applicability to the sample sum of those variables.  Since n, the sample size, is just a constant, it can be multiplied to \bar{X} to obtain \sum_{i = 1}^{n} X_i.  For a sufficiently large n, this new statistic still has an approximately normal distribution, just with a new expected value and a new variance.

\sum_{i = 1}^{n} X_i \overset{approx.}{\sim} \text{Normal} (n\mu, n\sigma^2)

Applied Statistics Lesson of the Day – What “Linear” in Linear Regression Really Means

Linear regression is one of the most commonly used tools in statistics, yet one of its fundamental features is commonly misunderstood by many non-statisticians.  I have witnessed this misunderstanding on numerous occasions in my work experience in statistical consulting and statistical education, and it is important for all statisticians to be aware of this common misunderstanding, to anticipate it when someone is about to make this mistake, and to educate that person about the correct meaning.

Consider the simple linear regression model:

Y = \beta_0 + \beta_1x + \varepsilon.

The “linear” in linear regression refers to the linearity between the response variable (Y) and the regression coefficients (\beta_0 and \beta_1).  It DOES NOT refer to the linearity between the response variable (Y) and the explanatory variable (x) This is contrary to mathematical descriptions of linear relationships; for example, when high school students learn about the equation of a line,

y = mx + b

the relationship is called “linear” because of the linearity between y and x.  This is the source of the mistaken understanding about the meaning of “linear” in linear regression; I am grateful that my applied statistics professor, Dr. Boxin Tang, emphasized the statistical meaning of “linear” when he taught linear regression to me.

Why is this difference in terminology important?  A casual observer may be puzzled by this apparent nit-picking of the semantics.  This terminology is important because the estimation of the regression coefficients in a regression model depends on the relationship between the response variable and the regression coefficients.  If this relationship is linear, then the estimation is very simple and can be done analytically by linear algebra.  If not, then the estimation can be very difficult and often cannot be done analytically – numerical methods must be used, instead.

Now, one of the assumptions of linear regression is the linearity between the response variable (Y) and the explanatory variable (x).  However, what if the scatter plot of Y versus x reveals a non-linear relationship, such as a quadratic relationship?  In that case, the solution is simple – just replace x with x^2.  (Admittedly, if the interpretation of the regression coefficient is important, then such interpretation becomes more difficult with this transformation.  However, if prediction of the response is the key goal, then such interpretation is not necessary, and this is not a problem.)  The important point is that the estimation of the regression coefficients can still be done by linear algebra after the transformation of the explantory variable.

Determining chemical concentration with standard addition: An application of linear regression in JMP – A Guest Blog Post for the JMP Blog

I am very excited to announce that I have been invited by JMP to be a guest blogger for its official blog!  My thanks to Arati Mejdal, Global Social Media Manager for the JMP Division of SAS, for welcoming me into the JMP blogging community with so much support and encouragement, and I am pleased to publish my first post on the JMP Blog!  Mark Bailey and Byron Wingerd from JMP provided some valuable feedback to this blog post, and I am fortunate to get the chance to work with and learn from them!

Following the tradition of The Chemical Statistician, this post combines my passions for statistics and chemistry by illustrating how simple linear regression can be used for the method of standard addition in analytical chemistry.  In particular, I highlight the useful capability of the “Inverse Prediction” function under “Fit Model” platform in JMP to estimate the predictor given an observed response value (i.e. estimate the value of x_i given y_i).  Check it out!

JMP blog post - standard addition

Follow

Get every new post delivered to your Inbox.

Join 445 other followers