Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips.  I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance.  This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out!  I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series!  It is a pleasure to be a guest blogger for JMP!

Using PROC SGPLOT to Produce Box Plots with Contrasting Colours in SAS

I previously explained the statistics behind box plots and how to produce them in R in a very detailed tutorial.  I also illustrated how to produce side-by-side box plots with contrasting patterns in R.

Here is an example of how to make box plots in SAS using the VBOX statement in PROC SGPLOT.  I modified the built-in data set SASHELP.CLASS to generate one that suits my needs.

The PROC TEMPLATE statement specifies the contrasting colours to be used for different classes.  I also include code for exportingthe result into a PDF file using ODS PDF.  (I used varying shades of grey to allow the contrast to be shown when printed in black and white.)

Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples.  Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis!  Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

Visual index of plots and corresponding R scripts

Dear Readers of The Chemical Statistician,

Joanna Zhao, an undergraduate researcher in the Department of Statistics at the University of British Columbia, produced a visual index of over 100 plots using ggplot2, the R package written by Hadley Wickham.

An example of a plot and its source R code on Joanna Zhao’s catalog.

Click on a thumbnail of any picture in this catalog – you will see the figure AND all of the necessary code to reproduce it.  These plots are from Naomi Robbins‘ book “Creating More Effective Graphs”.

If you

• want to produce an effective plot in R
• roughly know what the plot should look like
• but could really use an example to get started,

then this is a great resource for you!  A related GitHub repository has the code for ALL figures and the infrastructure for Joanna’s Shiny app.

I learned about this resource while working in my job at the British Columbia Cancer Agency; I am fortunate to attend a wonderful seminar series on statistics at the British Columbia Centre for Disease Control, and a colleague from this seminar told me about it.  By sharing this with you, I hope that it will immensely help you with your data visualization needs!

Determining chemical concentration with standard addition: An application of linear regression in JMP – A Guest Blog Post for the JMP Blog

I am very excited to announce that I have been invited by JMP to be a guest blogger for its official blog!  My thanks to Arati Mejdal, Global Social Media Manager for the JMP Division of SAS, for welcoming me into the JMP blogging community with so much support and encouragement, and I am pleased to publish my first post on the JMP Blog!  Mark Bailey and Byron Wingerd from JMP provided some valuable feedback to this blog post, and I am fortunate to get the chance to work with and learn from them!

Following the tradition of The Chemical Statistician, this post combines my passions for statistics and chemistry by illustrating how simple linear regression can be used for the method of standard addition in analytical chemistry.  In particular, I highlight the useful capability of the “Inverse Prediction” function under “Fit Model” platform in JMP to estimate the predictor given an observed response value (i.e. estimate the value of $x_i$ given $y_i$).  Check it out!

Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations.  For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

Introduction

Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the $Beta(2, 5)$ distribution actually integrates to 1 over its support set.  This post follows from my previous post on trapezoidal integration.

Image courtesy of Qef from

Conceptual Background of Rectangular Integration (a.k.a. The Midpoint Rule)

Rectangular integration is a numerical integration technique that approximates the integral of a function with a rectangle.  It uses rectangles to approximate the area under the curve.  Here are its features:

• The rectangle’s width is determined by the interval of integration.
• One rectangle could span the width of the interval of integration and approximate the entire integral.
• Alternatively, the interval of integration could be sub-divided into $n$ smaller intervals of equal lengths, and $n$ rectangles would used to approximate the integral; each smaller rectangle has the width of the smaller interval.
• The rectangle’s height is the function’s value at the midpoint of its base.
• Within a fixed interval of integration, the approximation becomes more accurate as more rectangles are used; each rectangle becomes narrower, and the height of the rectangle better captures the values of the function within that interval.

Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

Introduction

Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) probability density function actually integrates to 1 over its support set.  Fully commented and readily usable R code will be provided at the end.

Given a probability density function (PDF) and its support set as vectors in an array programming language like R, how do you integrate the PDF over its support set to ensure that it equals to 1?  Read the rest of this post to view my own R function to implement trapezoidal integration and learn how to use it to numerically approximate integrals.

Detecting Unfair Dice in Casinos with Bayes’ Theorem

Introduction

I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook.  I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively.

The Problem

The following question is a slightly modified version of Exercise #1.2 on Page 8 in “Biological Sequence Analysis” by Durbin, Eddy, Krogh and Mitchison.

An occasionally dishonest casino uses 2 types of dice.  Of its dice, 97% are fair but 3% are unfair, and a “five” comes up 35% of the time for these unfair dice.  If you pick a die randomly and roll it, how many “fives”  in a row would you need to see before it was most likely that you had picked an unfair die?”

Read more to learn how to create the following plot and how it invokes Bayes’ Theorem to solve the above problem!

Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

Introduction

Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.

Previous posts in this series on EDA include

Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog!

Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Introduction

This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots.  I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA).  Previous posts in this series on EDA include

Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set.  I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.  In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!

Exploratory Data Analysis – Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

Update on July 15, 2013:

Thanks to Harlan Nelson for noting on AnalyticBridge that the ozone concentrations for both New York and Ozonopolis are non-negative quantities, so their kernel density plot should have non-negative support sets.  This has been corrected in this post by

– defining new variables called max.ozone and max.ozone2

– using the options “from = 0” and “to = max.ozone” or “to = max.ozone2” in the density() function when defining density.ozone and density.ozone2 in the R code.

Update on February 2, 2014:

Harlan also noted in the above comment that any truncated kernel density estimator (KDE) from density() in R does not integrate to 1 over its support set.  Thanks to Julian Richer Daily for suggesting on AnalyticBridge to scale any truncated kernel density estimator (KDE) from density() by its integral to get a KDE that integrates to 1 over its support set.  I have used my own function for trapezoidal integration to do so, and this has been added below.

I thank everyone for your patience while I took the time to write a post about numerical integration before posting this correction.  I was in the process of moving between jobs and cities when Harlan first brought this issue to my attention, and I had also been planning a major expansion of this blog since then.  I am glad that I have finally started a series on numerical integration to provide the conceptual background for the correction of this error, and I hope that they are helpful.  I recognize that this is a rather late correction, and I apologize for any confusion.

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first half focused on the conceptual foundations of kernel density estimation.

Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation.  It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R.  It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

This is another post in a recent series on exploratory data analysis, which has included posts on descriptive statistics, box plots, violin plots, the conceptual foundations of empirical cumulative distribution functions (CDFs), and how to plot empirical CDFs in R.

Read the rest of this post to learn how to create the above combination of a kernel density plot and a rug plot!

Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Introduction

Continuing my recent series on exploratory data analysis (EDA), and following up on the last post on the conceptual foundations of empirical cumulative distribution functions (CDFs), this post shows how to plot them in R.  (Previous posts in this series on EDA include descriptive statistics, box plots, kernel density estimation, and violin plots.)

I will plot empirical CDFs in 2 ways:

1. using the built-in ecdf() and plot() functions in R
2. calculating and plotting the cumulative probabilities against the ordered data

Continuing from the previous posts in this series on EDA, I will use the “Ozone” data from the built-in “airquality” data set in R.  Recall that this data set has missing values, and, just as before, this problem needs to be addressed when constructing plots of the empirical CDFs.

Recall the plot of the empirical CDF of random standard normal numbers in my earlier post on the conceptual foundations of empirical CDFs.  That plot will be compared to the plots of the empirical CDFs of the ozone data to check if they came from a normal distribution.

Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

Exploratory Data Analysis: Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Introduction

Recently, I began a series on exploratory data analysis (EDA), and I have written about descriptive statistics, box plots, and kernel density plots so far.  As previously mentioned in my post on box plots, there is a way to combine box plots and kernel density plots.  This combination results in violin plots, and I will show how to create them in R today.

Continuing from my previous posts on EDA, I will use 2 univariate data sets.  One is the “ozone” data vector that is part of the “airquality” data set that is built into R; this data set contains data on New York’s air pollution.  The other is a simulated data set of ozone pollution in a fictitious city called “Ozonopolis”.  It is important to remember that the ozone data from New York has missing values, and this has created complications that needed to be addressed in previous posts; missing values need to be addressed for violin plots, too, and in a different way than before.

The vioplot() command in the “vioplot” package creates violin plots; the plotting options in this function are different and less versatile than other plotting functions that I have used in R.  Thus, I needed to be more creative with the plot(), title(), and axis() functions to create the plots that I want.  Read the details carefully to understand and benefit fully from the code.

Read further to learn how to create these violin plots that combine box plots with kernel density plots!  Be careful – the syntax is more complicated than usual!

Exploratory Data Analysis: Kernel Density Estimation – Conceptual Foundations

For the sake of brevity, this post has been created from the first half of a previous long post on kernel density estimation.  This first half focuses on the conceptual foundations of kernel density estimationThe second half will focus on constructing kernel density plots and rug plots in R.

Introduction

Recently, I began a series on exploratory data analysis; so far, I have written about computing descriptive statistics and creating box plots in R for a univariate data set with missing values.  Today, I will continue this series by introducing the underlying concepts of kernel density estimation, a useful non-parametric technique for visualizing the underlying distribution of a continuous variable.  In the follow-up post, I will show how to construct kernel density estimates and plot them in R.  I will also introduce rug plots and show how they can complement kernel density plots.

But first – read the rest of this post to learn the conceptual foundations of kernel density estimation.

Exploratory Data Analysis: Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Introduction

Last week, I wrote the first post in a series on exploratory data analysis (EDA).  I began by calculating summary statistics on a univariate data set of ozone concentration in New York City in the built-in data set “airquality” in R.  In particular, I talked about how to calculate those statistics when the data set has missing values.  Today, I continue this series by creating box plots in R and showing different variations and extensions that can be added; be sure to examine the details of this post’s R code for some valuable details.  I learned many of these tricks from Robert Kabacoff’s “R in Action” (2011).  Robert also has a nice blog called Quick-R that I consult often.

Recall that I the “Ozone” vector in the data set “airquality” has missing values.  Let’s remove those missing values first before constructing the box plots.

# abstract the raw data vector
ozone0 = airquality\$Ozone

# remove the missing values
ozone = ozone0[!is.na(ozone)]



Box Plots – What They Represent

The simplest box plot can be obtained by using the basic settings in the boxplot() command.  As usual, I use png() and dev.off() to print the image to a local folder on my computer.

png('INSERT YOUR DIRECTORY HERE/box plot ozone.png')
boxplot(ozone, ylab = 'Ozone (ppb)', main = 'Box Plot of Ozone in New York')
dev.off()

When Does the Kinetic Theory of Gases Fail? Examining its Postulates with Assistance from Simple Linear Regression in R

Introduction

The Ideal Gas Law, $\text{PV} = \text{nRT}$, is a very simple yet useful relationship that describes the behaviours of many gases pretty well in many situations.  It is “Ideal” because it makes some assumptions about gas particles that make the math and the physics easy to work with; in fact, the simplicity that arises from these assumptions allows the Ideal Gas Law to be easily derived from the kinetic theory of gases.  However, there are situations in which those assumptions are not valid, and, hence, the Ideal Gas Law fails.

Boyle’s law is inherently a part of the Ideal Gas Law.  It states that, at a given temperature, the pressure of an ideal gas is inversely proportional to its volume.  Equivalently, it states the product of the pressure and the volume of an ideal gas is a constant at a given temperature.

$\text{P} \propto \text{V}^{-1}$

An Example of The Failure of the Ideal Gas Law

This law is valid for many gases in many situations, but consider the following data on the pressure and volume of 1.000 g of oxygen at 0 degrees Celsius.  I found this data set in Chapter 5.2 of “General Chemistry” by Darrell Ebbing and Steven Gammon.

               Pressure (atm)      Volume (L)              Pressure X Volume (atm*L)
[1,]           0.25                2.8010                  0.700250
[2,]           0.50                1.4000                  0.700000
[3,]           0.75                0.9333                  0.699975
[4,]           1.00                0.6998                  0.699800
[5,]           2.00                0.3495                  0.699000
[6,]           3.00                0.2328                  0.698400
[7,]           4.00                0.1744                  0.697600
[8,]           5.00                0.1394                  0.697000

The right-most column is the product of pressure and temperature, and it is not constant.  However, are the differences between these values significant, or could it be due to some random variation (perhaps round-off error)?

Here is the scatter plot of the pressure-volume product with respect to pressure.

These points don’t look like they are on a horizontal line!  Let’s analyze these data using normal linear least-squares regression in R.

How to Calculate a Partial Correlation Coefficient in R: An Example with Oxidizing Ammonia to Make Nitric Acid

Introduction

Today, I will talk about the math behind calculating partial correlation and illustrate the computation in R.  The computation uses an example involving the oxidation of ammonia to make nitric acid, and this example comes from a built-in data set in R called stackloss.

I read Pages 234-237 in Section 6.6 of “Discovering Statistics Using R” by Andy Field, Jeremy Miles, and Zoe Field to learn about partial correlation.  They used a data set called “Exam Anxiety.dat” available from their companion web site (look under “6 Correlation”) to illustrate this concept; they calculated the partial correlation coefficient between exam anxiety and revision time while controlling for exam score.  As I discuss further below, the plot between the 2 above residuals helps to illustrate the calculation of partial correlation coefficients.  This plot makes intuitive sense; if you take more time to study for an exam, you tend to have less exam anxiety, so there is a negative correlation between revision time and exam anxiety.

They used a function called pcor() in a package called “ggm”; however, I suspect that this package is no longer working properly, because it depends on a deprecated package called “RBGL” (i.e. “RBGL” is no longer available in CRAN).  See this discussion thread for further information.  Thus, I wrote my own R function to illustrate partial correlation.

Partial correlation is the correlation between 2 random variables while holding other variables constant.  To calculate the partial correlation between X and Y while holding Z constant (or controlling for the effect of Z, or averaging out Z),