Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data
July 9, 2013 4 Comments
Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set. I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R. In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.
Previous posts in this series on EDA include
- Descriptive statistics
- Box plots
- The conceptual foundations of kernel density estimation
- How to construct kernel density plots and rug plots in R
- Violin plots
- The conceptual foundations of empirical cumulative distribution functions (CDFs)
- 2 ways of plotting empirical CDFs in R
Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!
What is a Histogram?
A histogram is a plot of the counts or proportions of data in disjoint intervals along the support set. Note that the actual values of the data are not used; a datum’s presence in a particular interval (or “bin”) is merely tallied, and a histogram displays vertical bars with heights representing the tally or the proportion of the sample size in each bin.
A histogram is useful for visualizing the distribution of the data and answering the following common and basic questions about a data set.
- Is the distribution symmetric, left-skewed or right-skewed?
- How many modes does the distribution have?
- Are there any outliers?
Different histograms can be made from the same data set because of:
- Different numbers of intervals
- Different starting points (lower limit of first interval) and end points (upper limit of last interval)
- Different rules for assigning which data belong to which intervals – these rules are usually set in terms of the boundaries of the intervals
The first reason accounts for much bigger differences than the last 2 reasons. Alas, let’s show the steps of constructing a histogram first, then discuss these differences afterward.
How to Construct a Histogram
Here is a typical way of creating a histogram; again, for reasons just stated above, this is not the only way.
1. Find the range (i.e. the maximum minus the minimum, ) of the data; denote the range as
2. Divide the range by some arbitrary number (i.e. the number of intervals or “bins” that you want in the histogram); denote this number as
3. Use and to set the boundaries of the intervals:
- The first interval has the minimum, , as the left boundary and as the right boundary.
- The second interval has as the left boundary and as the right boundary.
- The Jth interval has as the left boundary and as the right boundary.
- If J = B (i.e. if the Jth interval is the last interval), then the right boundary is just the maximum, .
4. Choose a rule on how to assign the data into the intervals such that every datum must be assigned to one and only one interval; here is a common rule:
- For the first interval, include all points less than or equal to . This ensures that the minimum is included in the first interval.
- For all other intervals, include all points greater than and less than or equal to .
5. Count the number of data that fall into each interval.
6. Depending on the type of vertical axis that you want, plot vertical bars representing the sample counts or sample proportions of data that fall into each interval.
This method uses the minimum as the lower limit of the first interval and the maximum as the upper limit of the last interval. However, these limits can also be widened for convenience. For example, the histogram for the “Ozone” data set (shown later in this post) uses 0 as the lower limit of the first interval; this is a sensible choice, since ozone concentration is non-negative, and the lowest concentration in this data set is 1 ppb. The lesson to take away is to look at your data and use your judgment.
For Step #3, I have seen some textbooks that set the starting point as for data sets with integer data; this ensures that no boundary is an integer and prevents any datum from falling exactly on a boundary.
Choosing the Number of Intervals or the Interval Width
The above procedure is straightforward except for one aspect: how to choose the number of intervals. (This is equivalent to formulating the interval width, since the range divided by the number of intervals equals the interval width.)
A histogram with too few intervals will hide key patterns in the distribution.
A histogram with too many intervals will show too much noise about the data and obscure the underlying pattern.
After trying different numbers of intervals, I produced the following histogram, which best shows the distribution without too much of the noise that gives a “choppy” appearance.
This histogram shows a few key attributes about the distribution of the “Ozone” data.
- It is right-skewed with the mode at about 15 ppb and a slight rise again at about 70 ppb
- It is unimodal
- There are some outliers near the high end
There is no “best” rule for choosing the number of intervals; in my experience, it’s best to try multiple numbers of intervals and choose the one number that best shows the underlying pattern that you aim to capture.
There are some guidelines that suggest an optimal number of intervals in the “breaks” option; I encourage you to read the “Details” section in the documentation for the hist() function in R for more information.
R Code for Producing Histograms
Here is the R code to generate the above plots. I used the “breaks” option to set the number of intervals in each histogram.
In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots. Stay tuned!
##### Exploratory Data Analysis of Ozone Pollution Data in New York ##### By Eric Cai - The Chemical Statistician # clear all variables rm(list = ls(all.names = TRUE)) # extract "Ozone" data vector ozone = airquality$Ozone # histogram with too few intervals png('INSERT YOUR DIRECTORY PATH HERE/histogram with too few intervals.png') hist(ozone, breaks = 3, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Few Intervals') dev.off() # histogram with too many intervals png('INSERT YOUR DIRECTORY PATH HERE/histogram with too many intervals.png') hist(ozone, breaks = 25, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Many Intervals') dev.off() # histogram png('INSERT YOUR DIRECTORY PATH HERE/histogram.png') hist(ozone, breaks = 15, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data') dev.off()