Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set.  I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.  In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

histogram

Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!

What is a Histogram?

histogram is a plot of the counts or proportions of data in disjoint intervals along the support setNote that the actual values of the data are not used; a datum’s presence in a particular interval (or “bin”) is merely tallied, and a histogram displays vertical bars with heights representing the tally or the proportion of the sample size in each bin.

A histogram is useful for visualizing the distribution of the data and answering the following common and basic questions about a data set.

Different histograms can be made from the same data set because of:

  1. Different numbers of intervals
  2. Different starting points (lower limit of first interval) and end points (upper limit of last interval)
  3. Different rules for assigning which data belong to which intervals – these rules are usually set in terms of the boundaries of the intervals

The first reason accounts for much bigger differences than the last 2 reasons.  Alas, let’s show the steps of constructing a histogram first, then discuss these differences afterward.

How to Construct a Histogram

Here is a typical way of creating a histogram; again, for reasons just stated above, this is not the only way.

1.     Find the range (i.e. the maximum minus the minimum, X_{(n)} - X_{(1)}) of the data; denote the range as R

2.     Divide the range by some arbitrary number (i.e. the number of intervals or “bins” that you want in the histogram); denote this number as B

3.     Use R and B to set the boundaries of the intervals:

  • The first interval has the minimum, X_{(1)}, as the left boundary and X_{(1)} + R/B as the right boundary.
  • The second interval has X_{(1)} + R/B as the left boundary and X_{(1)} + 2R/B as the right boundary.
  • The Jth interval has X_{(1)} + (J-1)R/B as the left boundary and X_{(1)} + JR/B as the right boundary.
  • If J = B (i.e. if the Jth interval is the last interval), then the right boundary is just the maximum, X_{(n)}.

4.     Choose a rule on how to assign the data into the intervals such that every datum must be assigned to one and only one interval; here is a common rule:

  • For the first interval, include all points less than or equal to X_{(1)} + R/B.  This ensures that the minimum is included in the first interval.
  • For all other intervals, include all points greater than X_{(1)} + (J-1)R/B and less than or equal to X_{(1)} + JR/B.

5.     Count the number of data that fall into each interval.

6.     Depending on the type of vertical axis that you want, plot vertical bars representing the sample counts or sample proportions of data that fall into each interval.

This method uses the minimum as the lower limit of the first interval and the maximum as the upper limit of the last interval.  However, these limits can also be widened for convenience.  For example, the histogram for the “Ozone” data set (shown later in this post) uses 0 as the lower limit of the first interval; this is a sensible choice, since ozone concentration is non-negative, and the lowest concentration in this data set is 1 ppb.  The lesson to take away is to look at your data and use your judgment.

For Step #3, I have seen some textbooks that set the starting point as X_{(1)} - 0.5 for data sets with integer data; this ensures that no boundary is an integer and prevents any datum from falling exactly on a boundary.

Choosing the Number of Intervals or the Interval Width

The above procedure is straightforward except for one aspect: how to choose the number of intervals.  (This is equivalent to formulating the interval width, since the range divided by the number of intervals equals the interval width.)

A histogram with too few intervals will hide key patterns in the distribution.

histogram with too few intervals

A histogram with too many intervals will show too much noise about the data and obscure the underlying pattern.

histogram with too many intervals

After trying different numbers of intervals, I produced the following histogram, which best shows the distribution without too much of the noise that gives a “choppy” appearance.

histogram

This histogram shows a few key attributes about the distribution of the “Ozone” data.

  • It is right-skewed with the mode at about 15 ppb and a slight rise again at about 70 ppb
  • It is unimodal
  • There are some outliers near the high end

There is no “best” rule for choosing the number of intervals; in my experience, it’s best to try multiple numbers of intervals and choose the one number that best shows the underlying pattern that you aim to capture.

There are some guidelines that suggest an optimal number of intervals in the “breaks” option; I encourage you to read the “Details” section in the documentation for the hist() function in R for more information.

R Code for Producing Histograms

Here is the R code to generate the above plots.  I used the “breaks” option to set the number of intervals in each histogram.

In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.  Stay tuned!

##### Exploratory Data Analysis of Ozone Pollution Data in New York
##### By Eric Cai - The Chemical Statistician
# clear all variables
rm(list = ls(all.names = TRUE))

# extract "Ozone" data vector
ozone = airquality$Ozone

# histogram with too few intervals
png('INSERT YOUR DIRECTORY PATH HERE/histogram with too few intervals.png')
hist(ozone, breaks = 3, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Few Intervals')
dev.off()

# histogram with too many intervals
png('INSERT YOUR DIRECTORY PATH HERE/histogram with too many intervals.png')
hist(ozone, breaks = 25, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Many Intervals')
dev.off()

# histogram
png('INSERT YOUR DIRECTORY PATH HERE/histogram.png')
hist(ozone, breaks = 15, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data')
dev.off()
About these ads

2 Responses to Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

  1. Fernando says:

    After examining your histogram I thought it might have two modes not one. Could the second higher valued mode be due to a different set of condition that affect the area. An example of different conditions could be two sources regions or pollution source that wind ushers clean or polluted air into New York, such as wind from the Atlantic (clean air) vs. polluted air from the mainland(polluted)?

    • Good observation, Fernando. I define the mode as the global maximizer of the relative frequency function; thus, according to my definition, the local maximizer at 75 ppb is not a mode. However, some people define local maximizers as modes.

      If a local maximum is just slightly less than the global maximum, then it would be sensible to consider that respective local maximizer as a mode, too. However, it is pretty clear that the mode at 25 ppb is more than twice as large as the local maximum at 75 ppb.

      As for the cause of this local maximum, it’s best not to make any causal inferences just from a histogram or any exploratory data analysis. It is, however, good to consider those possible causes as predictors in a regression model in further analyses.

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 343 other followers

%d bloggers like this: