Exploratory Data Analysis: Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis


Last week, I wrote the first post in a series on exploratory data analysis (EDA).  I began by calculating summary statistics on a univariate data set of ozone concentration in New York City in the built-in data set “airquality” in R.  In particular, I talked about how to calculate those statistics when the data set has missing values.  Today, I continue this series by creating box plots in R and showing different variations and extensions that can be added; be sure to examine the details of this post’s R code for some valuable details.  I learned many of these tricks from Robert Kabacoff’s “R in Action” (2011).  Robert also has a nice blog called Quick-R that I consult often.

Recall that I the “Ozone” vector in the data set “airquality” has missing values.  Let’s remove those missing values first before constructing the box plots.

# abstract the raw data vector
ozone0 = airquality$Ozone

# remove the missing values
ozone = ozone0[!is.na(ozone)] 

Box Plots – What They Represent

The simplest box plot can be obtained by using the basic settings in the boxplot() command.  As usual, I use png() and dev.off() to print the image to a local folder on my computer.

png('INSERT YOUR DIRECTORY HERE/box plot ozone.png')
boxplot(ozone, ylab = 'Ozone (ppb)', main = 'Box Plot of Ozone in New York')

box plot ozone

What do the different parts of this box plot mean?

Read more of this post

When Does the Kinetic Theory of Gases Fail? Examining its Postulates with Assistance from Simple Linear Regression in R


The Ideal Gas Law, \text{PV} = \text{nRT} , is a very simple yet useful relationship that describes the behaviours of many gases pretty well in many situations.  It is “Ideal” because it makes some assumptions about gas particles that make the math and the physics easy to work with; in fact, the simplicity that arises from these assumptions allows the Ideal Gas Law to be easily derived from the kinetic theory of gases.  However, there are situations in which those assumptions are not valid, and, hence, the Ideal Gas Law fails.

Boyle’s law is inherently a part of the Ideal Gas Law.  It states that, at a given temperature, the pressure of an ideal gas is inversely proportional to its volume.  Equivalently, it states the product of the pressure and the volume of an ideal gas is a constant at a given temperature.

\text{P} \propto \text{V}^{-1}

An Example of The Failure of the Ideal Gas Law

This law is valid for many gases in many situations, but consider the following data on the pressure and volume of 1.000 g of oxygen at 0 degrees Celsius.  I found this data set in Chapter 5.2 of “General Chemistry” by Darrell Ebbing and Steven Gammon.

               Pressure (atm)      Volume (L)              Pressure X Volume (atm*L)
[1,]           0.25                2.8010                  0.700250
[2,]           0.50                1.4000                  0.700000
[3,]           0.75                0.9333                  0.699975
[4,]           1.00                0.6998                  0.699800
[5,]           2.00                0.3495                  0.699000
[6,]           3.00                0.2328                  0.698400
[7,]           4.00                0.1744                  0.697600
[8,]           5.00                0.1394                  0.697000

The right-most column is the product of pressure and temperature, and it is not constant.  However, are the differences between these values significant, or could it be due to some random variation (perhaps round-off error)?

Here is the scatter plot of the pressure-volume product with respect to pressure.

scatter plot pv vs pressure

These points don’t look like they are on a horizontal line!  Let’s analyze these data using normal linear least-squares regression in R.

Read more of this post

Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City


This is the first of a series of posts on exploratory data analysis (EDA).  This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R.  This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets.  In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.

Read more of this post

Spatial Statistics Seminar in Toronto – Tuesday, May 21, 2013 @ SAS Canada Headquarters

I volunteer with the Southern Ontario Regional Association (SORA) of the Statistical Society of Canada (SSC) to organize a seminar series on business analytics here in Toronto.  The final seminar of the 2012-2013 series will be held on Tuesday, May 21 at SAS Canada Headquarters.  If you’re interested in attending, please email seminar.sora@gmail.com with the following subject to get a confirmation: Registration: Seminar by BBM Canada 

Speakers: Derrick Gray and Ricardo Gomez-Insausti – BBM Canada 

Title: The Power of the Latitude and Longitude – An Application of Spatial Techniques to Audience Measurement Data

Date: Tuesday, May 21, 2013 



SAS Headquarters Office 

Suite 500

280 King Street East



Networking: 2:00 – 2:30 pm 

Introductory Remarks: 2:30 – 2:45 pm 

Seminar Time: 2:45 – 3:45 pm 

Discussion and Networking: 3:45 – 5:00 pm 

Read the entire post to see the abstract and the speakers’ biographies.

Read more of this post

Toronto Data Mining Forum @ SAS Canada: Wednesday, May 15, 2013

I mentioned in a recent post about the value of learning from and networking with other statisticians and analytics professionals at SAS User Group meetings, especially here in Toronto.  I will attend the Toronto Data Mining Forum on Wednesday, May 15.  If you plan on going, I hope to see/meet you there!  Remember – these User Group meetings are free to attend!

The agenda and the registration information can be found here.

Toronto Data Mining Forum

Wednesday, May 15, 2013

8:30 AM to 12:00 PM

(Breakfast: 8:30 AM to 9:00 AM)

SAS Canada Toronto office

280 King St. East, Suite 500

Toronto, Ontario

M5A 1L1

Webinar – Advanced Predictive Modelling for Manufacturing

The company that I work for, Predictum, is about to begin a free webinar series on statistics and analytics, and I will present the first one on Tuesday, May 14, at 2 pm EDT.  This first webinar will focus on how partial least squares regression can be used as a predictive modelling technique; the data sets are written in the context of manufacturing, but it is definitely to all industries that need techniques beyond basic statistical tools like linear regression for predictive modelling.  JMP, a software that Predictum uses extensively, will be used to illustrate how partial least squares regression can be implemented.  This presentation will not be heavy in mathematical detail, so it will be accessible to a wide audience, including statisticians, analysts, managers, and executives. 

Eric Cai - Official Head Shot

Attend my company’s free webinar to listen to me talking about advanced predictive modelling and partial least squares regression!

To register for this free webinar, visit the webinar’s registration page on Webex.

How to Calculate a Partial Correlation Coefficient in R: An Example with Oxidizing Ammonia to Make Nitric Acid


Today, I will talk about the math behind calculating partial correlation and illustrate the computation in R.  The computation uses an example involving the oxidation of ammonia to make nitric acid, and this example comes from a built-in data set in R called stackloss.

I read Pages 234-237 in Section 6.6 of “Discovering Statistics Using R” by Andy Field, Jeremy Miles, and Zoe Field to learn about partial correlation.  They used a data set called “Exam Anxiety.dat” available from their companion web site (look under “6 Correlation”) to illustrate this concept; they calculated the partial correlation coefficient between exam anxiety and revision time while controlling for exam score.  As I discuss further below, the plot between the 2 above residuals helps to illustrate the calculation of partial correlation coefficients.  This plot makes intuitive sense; if you take more time to study for an exam, you tend to have less exam anxiety, so there is a negative correlation between revision time and exam anxiety.

residuals plot anxiety and revision time controlling exam score

They used a function called pcor() in a package called “ggm”; however, I suspect that this package is no longer working properly, because it depends on a deprecated package called “RBGL” (i.e. “RBGL” is no longer available in CRAN).  See this discussion thread for further information.  Thus, I wrote my own R function to illustrate partial correlation.

Partial correlation is the correlation between 2 random variables while holding other variables constant.  To calculate the partial correlation between X and Y while holding Z constant (or controlling for the effect of Z, or averaging out Z),

Read more of this post