# Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

#### Introduction

Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.

Previous posts in this series on EDA include

Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog!

#### What is a Quantile-Quantile Plot?

A quantile-quantile plot, or Q-Q plot, is a plot of the sorted quantiles of one data set against the sorted quantiles of another data set.  It is used to visually inspect the similarity between the underlying distributions of 2 data sets.  Each point (x, y) is a plot of a quantile of one distribution along the vertical axis (y-axis) against the corresponding quantile of the other distribution along the horizontal axis (x-axis).  If the 2 distributions are similar, then the points would lie close to the identity line, y = x.

The sample sizes of the 2 data sets do not have to be equal.

• If the 2 sample sizes are equal, then the Q-Q plot simply plots the sorted data of one data set against the sorted data of the other data set.
• If the 2 sample sizes are different, then the quantiles are usually selected to correspond to the sorted values from the smaller data set.  The quantiles for the larger data set are then interpolated.  Thus, it is possible that none of the data from the larger data set are used in the Q-Q plot.

The quantiles of the 2 data sets can be observed or theoretical.  When the quantiles from a real data set are plotted against the corresponding quantiles from a theoretical distribution, the resulting Q-Q plot serves as a visual tool for inspecting how closely the data set fits the chosen theoretical distribution.  (In an earlier post on checking for normality in a data set, I mentioned this as a useful visual technique to complement analytical techniques.  Q-Q plots can, of course, be used for any theoretical distribution, not just the normal distribution)  Later in this post, I will illustrate this commonly used tool with the “Ozone” data from the built-in “airpollution” data set in R.

#### Constructing Quantile-Quantile Plots to Check Goodness of Fit

The following steps will build a Q-Q plot to check how well a data set fits a particular theoretical distribution.  This is the most common purpose for Q-Q plots.

1. Sort the data from smallest to largest.  These are the quantiles.
2. Compute $n$ evenly spaced points on the interval $(0, 1)$, where $n$ is the sample size.  These points are the cumulative probabilities for the quantiles.  They will be used to calculate the quantiles for the theoretical distribution.  Notice that I have purposefully used an open interval to exclude 0 and 1 as possible points; some distributions have negative infinity, positive infinity, or both as the bound(s), so assigning a finite quantile as the 0%-quantile or 100%-quantile for such distributions would be invalid.  The most commonly used method of choosing these $n$ evenly spaced points is $\frac{k}{n+1}, k = 1, ... , n$.
3. Using the probabilities chosen in Step 2, compute the corresponding theoretical quantiles for the chosen theoretical distribution.
4. Plot the sorted sample quantiles against the sorted theoretical quantiles.
5. If the plotted points fall close to the identity line ($y = x$), then it is evidence to suggest that the sample data fit the chosen theoretical distribution well.

There has been vigorous debate in the statistical community about how the probabilities in Step 2 should be chosen; read Rick Wicklins blog post and Wikipedia’s entry on Q-Q plots for more details.  However, as the sample sizes increases, the differences between these methods decrease, and the resulting Q-Q plots are indistinguishable at large sample sizes.

#### Quantile-Quantile Plots in Action: Checking the Distribution of New York’s Ozone Data

My recent series on exploratory data analysis makes extensive use of the “Ozone” data from R’s built-in data set “airquality”, which contains air pollution data for New York.  I will now use Q-Q plots to assess the distribution of the “Ozone” data.

First, let’s extract the data and calculate some basic summary statistics.

##### Quantile-Quantile Plots of Ozone Pollution Data
##### By Eric Cai - The Chemical Statistician
# clear all variables
rm(list = ls(all.names = TRUE))

# view first 6 entries of the "Ozone" data frame

# extract "Ozone" data vector
ozone = airquality$Ozone # sample size of "ozone" length(ozone) # summary of "ozone" summary(ozone) # remove missing values from "ozone" ozone = ozone[!is.na(ozone)] # having removed missing values, find the number of non-missing values in "ozone" n = length(ozone) # calculate mean, variance and standard deviation of "ozone" mean.ozone = mean(ozone) var.ozone = var(ozone) sd.ozone = sd(ozone) Now, let’s set the n points in the interval (0,1) for the n equi-probable point-wise probabilities, each of which is assigned to the correspondingly ranked quantile. (The smallest probability is assigned to the smallest quantile, and the largest probability is assigned to the largest quantile.) These probabilities will be used to calculate the quantiles for each hypothesized theoretical distribution. # set n points in the interval (0,1) # use the formula k/(n+1), for k = 1,..,n # this is a vector of the n probabilities probabilities = (1:n)/(n+1) Since “ozone” is a continuous variable, let’s try fitting it to the normal distribution – the most commonly used distribution for continuous variables. Notice my use of the qnorm() function to calculate the quantiles from the normal distribution. Specifically, I used the sample mean and the sample standard deviation of “ozone” to specify the parameters of the normal distribution. # calculate normal quantiles using mean and standard deviation from "ozone" normal.quantiles = qnorm(probabilities, mean(ozone, na.rm = T), sd(ozone, na.rm = T)) Finally, let’s plot the theoretical quantiles on the horizontal axis and the sample quantiles on the vertical axis. Notice my use of the abline() function to add the identify line. The two parameters call for a line with an intercept of 0 and a slope of 1. # normal quantile-quantile plot for "ozone" png('INSERT YOUR DIRECTORY PATH HERE') plot(sort(normal.quantiles), sort(ozone) , xlab = 'Theoretical Quantiles from Normal Distribution', ylab = 'Sample Quqnatiles of Ozone', main = 'Normal Quantile-Quantile Plot of Ozone') abline(0,1) dev.off() Here is the resulting plot: The plotted points do not fall closely onto the identity line, so the data do not seem to come from the normal distribution. Is there another distribution that would work better? Recall from my earlier post on kernel density estimation that the “ozone” data are right-skewed. A gamma distribution with a small shape parameter tends to be right-skewed, so let’s try this instead. Notice my use of the sample mean and sample variance to estimate the shape and the scale parameters. # calculate gamma quantiles using mean and standard deviation from "ozone" to calculate shape and scale parameters gamma.quantiles = qgamma(probabilities, shape = mean.ozone^2/var.ozone, scale = var.ozone/mean.ozone) # gamma quantile-quantile plot for "ozone" png('INSERT YOUR DIRECTORY PATH HERE') plot(sort(gamma.quantiles), sort(ozone), xlab = 'Theoretical Quantiles from Gamma Distribution', ylab = 'Sample Quantiles of Ozone', main = 'Gamma Quantile-Quantile Plot of Ozone') abline(0,1) dev.off() Here is the resulting plot. Clearly, the data fit the gamma distribution better! Of course, this is no guarantee that this particular gamma distribution is the best distribution for this data set. It does not even guarantee that the gamma distribution is the best family of distributions for this data set. Nonetheless, it is a useful tool to visualize the goodness-of-fit of a data set to a distribution. R has functions for quickly producing Q-Q plots; they are qqnorm(), qqline(), and qqplot(). These are good functions to use. I built the above Q-Q plots using more rudimentary functions because • I wanted to use R’s rudimentary functions to illustrate the 5 steps of creating a Q-Q plot • My method ensures that the sample quantiles and the theoretical quantiles are on the same scale. #### References http://onlinestatbook.com Project Leader: David M. Lane, Rice University. ### 3 Responses to Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data 1. anspiess says: Using the ‘fitDistr’ function of the ‘propagate’ package gives me the following order of best fitted distributions to ‘ozone': library(propagate) res <- fitDistr(ozone) res$aic

Distribution AIC
16 Johnson SU -838.1165
4 Log-normal -834.4381
11 Generalized Trapezoidal -828.3748
12 Gamma -812.3913
19 4P Beta -803.8396
18 3P Weibull -802.6010
8 Triangular -801.4067
14 Laplace -799.1498
3 Generalized normal -797.8691
15 Gumbel -797.0433
13 Cauchy -796.9815
5 Scaled/shifted t- -795.3024
6 Logistic -783.7429
1 Normal -775.1892
2 Skewed-normal -773.1892
9 Trapezoidal -760.0612
7 Uniform -720.5523
10 Curvilinear Trapezoidal -718.7252
20 Arcsine -694.8609
21 von Mises -578.6548
17 Johnson SB -566.5763

with JohsonSU and log-normal on first/second place. Haven't checked the qqplots for that though…

Cheers,
Andrej

• Thanks for taking the initiative to conduct this analysis and share the results with us, Andrej! I appreciate you mentioning this function in the past, and it’s nice to see it in action!

I like to use multiple tools to explore a data set, and it’s best to combine the wisdom of multiple tools to gain that exploratory inference. Thanks for demonstrating another tool that I will to my bag of tricks!