## Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

#### Introduction

This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots.  I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA).  Previous posts in this series on EDA include

## Exploratory Data Analysis – Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

Update on July 15, 2013:

Thanks to Harlan Nelson for noting on AnalyticBridge that the ozone concentrations for both New York and Ozonopolis are non-negative quantities, so their kernel density plot should have non-negative support sets.  This has been corrected in this post by

- defining new variables called max.ozone and max.ozone2

- using the options “from = 0″ and “to = max.ozone” or “to = max.ozone2″ in the density() function when defining density.ozone and density.ozone2 in the R code.

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first half focused on the conceptual foundations of kernel density estimation.

#### Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation.  It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R.  It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

This is another post in a recent series on exploratory data analysis, which has included posts on descriptive statistics, box plots, violin plots, the conceptual foundations of empirical cumulative distribution functions (CDFs), and how to plot empirical CDFs in R.

Read the rest of this post to learn how to create the above combination of a kernel density plot and a rug plot!

## Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), and following up on the last post on the conceptual foundations of empirical cumulative distribution functions (CDFs), this post shows how to plot them in R.  (Previous posts in this series on EDA include descriptive statistics, box plots, kernel density estimation, and violin plots.)

I will plot empirical CDFs in 2 ways:

1. using the built-in ecdf() and plot() functions in R
2. calculating and plotting the cumulative probabilities against the ordered data

Continuing from the previous posts in this series on EDA, I will use the “Ozone” data from the built-in “airquality” data set in R.  Recall that this data set has missing values, and, just as before, this problem needs to be addressed when constructing plots of the empirical CDFs.

Recall the plot of the empirical CDF of random standard normal numbers in my earlier post on the conceptual foundations of empirical CDFs.  That plot will be compared to the plots of the empirical CDFs of the ozone data to check if they came from a normal distribution.

## Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

## Exploratory Data Analysis: Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

#### Introduction

Recently, I began a series on exploratory data analysis (EDA), and I have written about descriptive statistics, box plots, and kernel density plots so far.  As previously mentioned in my post on box plots, there is a way to combine box plots and kernel density plots.  This combination results in violin plots, and I will show how to create them in R today.

Continuing from my previous posts on EDA, I will use 2 univariate data sets.  One is the “ozone” data vector that is part of the “airquality” data set that is built into R; this data set contains data on New York’s air pollution.  The other is a simulated data set of ozone pollution in a fictitious city called “Ozonopolis”.  It is important to remember that the ozone data from New York has missing values, and this has created complications that needed to be addressed in previous posts; missing values need to be addressed for violin plots, too, and in a different way than before.

The vioplot() command in the “vioplot” package creates violin plots; the plotting options in this function are different and less versatile than other plotting functions that I have used in R.  Thus, I needed to be more creative with the plot(), title(), and axis() functions to create the plots that I want.  Read the details carefully to understand and benefit fully from the code.

Read further to learn how to create these violin plots that combine box plots with kernel density plots!  Be careful – the syntax is more complicated than usual!

## Exploratory Data Analysis: Kernel Density Estimation – Conceptual Foundations

For the sake of brevity, this post has been created from the first half of a previous long post on kernel density estimation.  This first half focuses on the conceptual foundations of kernel density estimationThe second half will focus on constructing kernel density plots and rug plots in R.

#### Introduction

Recently, I began a series on exploratory data analysis; so far, I have written about computing descriptive statistics and creating box plots in R for a univariate data set with missing values.  Today, I will continue this series by introducing the underlying concepts of kernel density estimation, a useful non-parametric technique for visualizing the underlying distribution of a continuous variable.  In the follow-up post, I will show how to construct kernel density estimates and plot them in R.  I will also introduce rug plots and show how they can complement kernel density plots.

A follow-up post will show how to construct these kernel density plots in R!

But first – read the rest of this post to learn the conceptual foundations of kernel density estimation.

## When Does the Kinetic Theory of Gases Fail? Examining its Postulates with Assistance from Simple Linear Regression in R

#### Introduction

The Ideal Gas Law, $\text{PV} = \text{nRT}$, is a very simple yet useful relationship that describes the behaviours of many gases pretty well in many situations.  It is “Ideal” because it makes some assumptions about gas particles that make the math and the physics easy to work with; in fact, the simplicity that arises from these assumptions allows the Ideal Gas Law to be easily derived from the kinetic theory of gases.  However, there are situations in which those assumptions are not valid, and, hence, the Ideal Gas Law fails.

Boyle’s law is inherently a part of the Ideal Gas Law.  It states that, at a given temperature, the pressure of an ideal gas is inversely proportional to its volume.  Equivalently, it states the product of the pressure and the volume of an ideal gas is a constant at a given temperature.

$\text{P} \propto \text{V}^{-1}$

#### An Example of The Failure of the Ideal Gas Law

This law is valid for many gases in many situations, but consider the following data on the pressure and volume of 1.000 g of oxygen at 0 degrees Celsius.  I found this data set in Chapter 5.2 of ”General Chemistry” by Darrell Ebbing and Steven Gammon.

               Pressure (atm)      Volume (L)              Pressure X Volume (atm*L)
[1,]           0.25                2.8010                  0.700250
[2,]           0.50                1.4000                  0.700000
[3,]           0.75                0.9333                  0.699975
[4,]           1.00                0.6998                  0.699800
[5,]           2.00                0.3495                  0.699000
[6,]           3.00                0.2328                  0.698400
[7,]           4.00                0.1744                  0.697600
[8,]           5.00                0.1394                  0.697000

The right-most column is the product of pressure and temperature, and it is not constant.  However, are the differences between these values significant, or could it be due to some random variation (perhaps round-off error)?

Here is the scatter plot of the pressure-volume product with respect to pressure.

These points don’t look like they are on a horizontal line!  Let’s analyze these data using normal linear least-squares regression in R.

## Webinar – Advanced Predictive Modelling for Manufacturing

The company that I work for, Predictum, is about to begin a free webinar series on statistics and analytics, and I will present the first one on Tuesday, May 14, at 2 pm EDT.  This first webinar will focus on how partial least squares regression can be used as a predictive modelling technique; the data sets are written in the context of manufacturing, but it is definitely to all industries that need techniques beyond basic statistical tools like linear regression for predictive modelling.  JMP, a software that Predictum uses extensively, will be used to illustrate how partial least squares regression can be implemented.  This presentation will not be heavy in mathematical detail, so it will be accessible to a wide audience, including statisticians, analysts, managers, and executives.

Attend my company’s free webinar to listen to me talking about advanced predictive modelling and partial least squares regression!

To register for this free webinar, visit the webinar’s registration page on Webex.

## Presentation Slides – Overcoming Multicollinearity and Overfitting with Partial Least Squares Regression in JMP and SAS

My slides on partial least squares regression at the Toronto Area SAS Society (TASS) meeting on September 14, 2012, can be found here.

#### My Presentation on Partial Least Squares Regression

My first presentation to Toronto Area SAS Society (TASS) was delivered on September 14, 2012.  I introduced a supervised learning/predictive modelling technique called partial least squares (PLS) regression; I showed how normal linear least squares regression is often problematic when used with big data because of multicollinearity and overfitting, explained how partial least squares regression overcomes these limitations, and illustrated how to implement it in SAS and JMP.  I also highlighted the variable importance for projection (VIP) score that can be used to conduct variable selection with PLS regression; in particular, I documented its effectiveness as a technique for variable selection by comparing some key journal articles on this issue in academic literature.

The green line is an overfitted classifier.  Not only does it model the underlying trend, but it also models the noise (the random variation) at the boundary.  It separates the blue and the red dots perfectly for this data set, but it will classify very poorly on a new data set from the same population.

## How to Calculate a Partial Correlation Coefficient in R: An Example with Oxidizing Ammonia to Make Nitric Acid

#### Introduction

Today, I will talk about the math behind calculating partial correlation and illustrate the computation in R with an example involving the oxidation of ammonia to make nitric acid using a built-in data set in R called stackloss.  In a separate post, I will also share an R function that I wrote to estimate partial correlation.  In a later post, I will discuss the interpretation of the partial correlation coefficient at length.

I read Pages 234-237 in Section 6.6 of “Discovering Statistics Using R” by Andy Field, Jeremy Miles, and Zoe Field to learn about partial correlation.  They used a data set called “Exam Anxiety.dat” available from their companion web site (look under “6 Correlation”) to illustrate this concept; they calculated the partial correlation coefficient between exam anxiety and revision time while controlling for exam score.  As I discuss further below, the plot between the 2 above residuals helps to illustrate the calculation of partial correlation coefficients.  This plot makes intuitive sense; if you take more time to study for an exam, you tend to have less exam anxiety, so there is a negative correlation between revision time and exam anxiety.

They used a function called pcor() in a package called “ggm”; however, I suspect that this package is no longer working properly, because it depends on a deprecated package called “RBGL” (i.e. “RBGL” is no longer available in CRAN).  See this discussion thread for further information.  Thus, I wrote my own R function to illustrate partial correlation.

Partial correlation is the correlation between 2 random variables while holding other variables constant.  To calculate the partial correlation between X and Y while holding Z constant (or controlling for the effect of Z, or averaging out Z),

## Checking the Goodness of Fit of the Poisson Distribution in R for Alpha Decay by Americium-241

#### Introduction

Today, I will discuss the alpha decay of americium-241 and use R to model the number of emissions from a real data set with the Poisson distribution.  I was especially intrigued in learning about the use of Am-241 in smoke detectors, and I will elaborate on this clever application.  I will then use the Pearson chi-squared test to check the goodness of fit of my model.  The R script for the full analysis is given at the end of the post; there is a particularly useful code for superscripting the mass number of a chemical isotope in the title of a plot.  While there are many examples of superscripts in plot titles and axes that can be found on the web, none showed how to put the superscript before a text.  I hope that this and other tricks in this script are of use to you.

Smoke Detector with Americium-241

## Presentation Slides: Machine Learning, Predictive Modelling, and Pattern Recognition in Business Analytics

I recently delivered a presentation entitled “Using Advanced Predictive Modelling and Pattern Recognition in Business Analytics” at the Statistical Society of Canada’s (SSC’s) Southern Ontario Regional Association (SORA) Business Analytics Seminar Series.  In this presentation, I

- discussed how traditional statistical techniques often fail in analyzing large data sets

- defined and described machine learning, supervised learning, unsupervised learning, and the many classes of techniques within these fields, as well as common examples in business analytics to illustrate these concepts

- introduced partial least squares regression and bootstrap forest (or random forest) as two examples of supervised learning (0r predictive modelling) techniques that can effectively overcome the common failures of traditional statistical techniques and can be easily implemented in JMP

- illustrated how partial least squares regression and bootstrap forest were successfully used to solve some major problems for 2 different clients at Predictum, where I currently work as a statistician

## How do Dew and Fog Form? Nature at Work with Temperature, Vapour Pressure, and Partial Pressure

In the early morning, especially here in Canada, I often see dew – water droplets formed by the condensation of water vapour on outside surfaces, like windows, car roofs, and leaves of trees.  I also sometimes see fog – water droplets or ice crystals that are suspended in air and often blocking visibility at great distances.  Have you ever wondered how they form?  It turns out that partial pressure, vapour pressure and temperature are the key phenomena at work.

Dew ( and Fog )

Source: Wikimedia

## Checking for Normality with Quantile Ranges and the Standard Deviation

#### Introduction

I was reading Michael Trosset’s “An Introduction to Statistical Inference and Its Applications with R”, and I learned a basic but interesting fact about the normal distribution’s interquartile range and standard deviation that I had not learned before.  This turns out to be a good way to check for normality in a data set.

In this post, I introduce several traditional ways of checking for normality (or goodness of fit in general), talk about the method that I learned from Trosset’s book, then build upon this method by possibly coming up with a new way to check for normality.  I have not fully established this idea, so I welcome your thoughts and ideas.

## Estimating the Decay Rate and the Half-Life of DDT in Trout – Applying Simple Linear Regression with Logarithmic Transformation

This blog post uses a function and a script written in R that were displayed in an earlier blog post.

#### Introduction

This is the second of a series of blog posts about simple linear regression; the first was written recently on some conceptual nuances and subtleties about this model.  In this blog post, I will use simple linear regression to analyze a data set with a logarithmic transformation and discuss how to make inferences on the regression coefficients and the means of the target on the original scale.  The data document the decay of dichlorodiphenyltrichloroethane (DDT) in trout in Lake Michigan; I found it on Page 49 in the book ”Elements of Environmental Chemistry” by Ronald A. Hites.  Future posts will also be written on the chemical aspects of this topic, including the environmental chemistry of DDT and exponential decay in chemistry and, in particular, radiochemistry.

Dichlorodiphenyltrichloroethane (DDT)

Source: Wikimedia Commons

A serious student of statistics or a statistician re-learning the fundamentals like myself should always try to understand the math and the statistics behind a software’s built-in function rather than treating it like a black box.  This is especially worthwhile for a basic yet powerful tool like simple linear regression.  Thus, instead of simply using the lm() function in R, I will reproduce the calculations done by lm() with my own function and script (posted earlier on my blog) to obtain inferential statistics on the regression coefficients.  However, I will not write or explain the math behind the calculations; they are shown in my own function with very self-evident variable names, in case you are interested.  The calculations are arguably the most straightforward aspects of linear regression, and you can easily find the derivations and formulas on the web, in introductory or applied statistics textbooks, and in regression textbooks.

## My Own R Function and Script for Simple Linear Regression – An Illustration with Exponential Decay of DDT in Trout

Here is the function that I wrote for doing simple linear regression, as alluded to in my blog post about simple linear regression on log-transformed data on the decay of DDT concentration in trout in Lake Michigan.  My goal was to replicate the 4 columns of the output from applying summary() to the output of lm().

To use this file and this script,

1) I saved this file as “simple linear regression.r”.

2) In the same folder, I saved a script called “DDT trout regression.r” that used this function to implement simple linear regression on the log-transformed DDT data.

3) I changed the working directory to this folder containing the function and the script.

4) I made sure “DDT trout regression.r” used the source() function to call my user-defined function for simple linear regression.

5) I ran “DDT trout regression.r”.

## Some Subtle and Nuanced Concepts about Simple Linear Regression

#### Introduction

This blog post will focus on some conceptual foundations of simple linear regression, a very common technique in statistics and a precursor for understanding multiple linear regression.  I will expose and clarify many nuances and subtleties that I did not fully absorb until my Master’s degree in statistics at the University of Toronto.

Future posts will explore the other important aspects of linear regression, including estimating its regression coefficients, statistical inference on the regression coefficients, checking the model’s assumptions, confidence bands, confidence intervals vs. prediction intervals, and analysis of variance.

## Discovering Argon with the 2-Sample t-Test

I learned about Lord Rayleigh’s discovery of argon in my 2nd-year analytical chemistry class while reading “Quantitative Chemical Analysis” by Daniel Harris.  (William Ramsay was also responsible for this discovery.)  This is one of my favourite stories in chemistry; it illustrates how diligence in measurement can lead to an elegant and surprising discovery.  I find no evidence that Rayleigh and Ramsay used statistics to confirm their findings; their paper was published 13 years before Gosset published about the t-test.  Thus, I will use a 2-sample t-test in R to confirm their result.

Photos of Lord Rayleigh

Source: Wikimedia Commons