# Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

July 9, 2013 2 Comments

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set. I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R. In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

- Descriptive statistics
- Box plots
- The conceptual foundations of kernel density estimation
- How to construct kernel density plots and rug plots in R
- Violin plots
- The conceptual foundations of empirical cumulative distribution functions (CDFs)
- 2 ways of plotting empirical CDFs in R

**Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!**

#### What is a Histogram?

A histogram is a plot of the ** counts** or

**of data in disjoint intervals along the support set.**

*proportions***Note that the actual values of the data are not used**; a datum’s presence in a particular interval (or “bin”) is merely tallied, and a histogram displays vertical bars with heights representing the tally or the proportion of the sample size in each bin.

A histogram is useful for visualizing the distribution of the data and answering the following common and basic questions about a data set.

- Is the distribution symmetric, left-skewed or right-skewed?
- How many modes does the distribution have?
- Are there any outliers?

Different histograms can be made from the same data set because of:

- Different numbers of intervals
- Different starting points (lower limit of first interval) and end points (upper limit of last interval)
- Different rules for assigning which data belong to which intervals – these rules are usually set in terms of the boundaries of the intervals

The first reason accounts for much bigger differences than the last 2 reasons. Alas, let’s show the steps of constructing a histogram first, then discuss these differences afterward.

#### How to Construct a Histogram

Here is a typical way of creating a histogram; again, for reasons just stated above, this is not the only way.

1. Find the range (i.e. the maximum minus the minimum, ) of the data; denote the range as

2. Divide the range by some arbitrary number (i.e. the number of intervals or “bins” that you want in the histogram); denote this number as

3. Use and to set the boundaries of the intervals:

- The first interval has the minimum, , as the left boundary and as the right boundary.
- The second interval has as the left boundary and as the right boundary.
- The J
*th*interval has as the left boundary and as the right boundary. - If J = B (i.e. if the J
*th*interval is the last interval), then the right boundary is just the maximum, .

4. Choose a rule on how to assign the data into the intervals such that every datum must be assigned to one and only one interval; here is a common rule:

- For the first interval, include all points less than or equal to . This ensures that the minimum is included in the first interval.
- For all other intervals, include all points greater than and less than or equal to .

5. Count the number of data that fall into each interval.

6. Depending on the type of vertical axis that you want, plot vertical bars representing the sample counts or sample proportions of data that fall into each interval.

This method uses the minimum as the lower limit of the first interval and the maximum as the upper limit of the last interval. However, these limits can also be widened for convenience. For example, the histogram for the “Ozone” data set (shown later in this post) uses 0 as the lower limit of the first interval; this is a sensible choice, since ozone concentration is non-negative, and the lowest concentration in this data set is 1 ppb. The lesson to take away is to look at your data and use your judgment.

For Step #3, I have seen some textbooks that set the starting point as for data sets with integer data; this ensures that no boundary is an integer and prevents any datum from falling exactly on a boundary.

#### Choosing the Number of Intervals or the Interval Width

The above procedure is straightforward except for one aspect: how to choose the number of intervals. (This is equivalent to formulating the interval width, since the range divided by the number of intervals equals the interval width.)

A histogram with too few intervals will hide key patterns in the distribution.

A histogram with too many intervals will show too much noise about the data and obscure the underlying pattern.

After trying different numbers of intervals, I produced the following histogram, which best shows the distribution without too much of the noise that gives a “choppy” appearance.

This histogram shows a few key attributes about the distribution of the “Ozone” data.

- It is right-skewed with the mode at about 15 ppb and a slight rise again at about 70 ppb
- It is unimodal
- There are some outliers near the high end

**There is no “best” rule for choosing the number of intervals; in my experience, it’s best to try multiple numbers of intervals and choose the one number that best shows the underlying pattern that you aim to capture.**

**There are some guidelines that suggest an optimal number of intervals in the “breaks” option**; I encourage you to read the “Details” section in the documentation for the hist() function in R for more information.

#### R Code for Producing Histograms

Here is the R code to generate the above plots. I used the “breaks” option to set the number of intervals in each histogram.

**In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots. Stay tuned!**

##### Exploratory Data Analysis of Ozone Pollution Data in New York ##### By Eric Cai - The Chemical Statistician # clear all variables rm(list = ls(all.names = TRUE)) # extract "Ozone" data vector ozone = airquality$Ozone # histogram with too few intervals png('INSERT YOUR DIRECTORY PATH HERE/histogram with too few intervals.png') hist(ozone, breaks = 3, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Few Intervals') dev.off() # histogram with too many intervals png('INSERT YOUR DIRECTORY PATH HERE/histogram with too many intervals.png') hist(ozone, breaks = 25, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Many Intervals') dev.off() # histogram png('INSERT YOUR DIRECTORY PATH HERE/histogram.png') hist(ozone, breaks = 15, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data') dev.off()

After examining your histogram I thought it might have two modes not one. Could the second higher valued mode be due to a different set of condition that affect the area. An example of different conditions could be two sources regions or pollution source that wind ushers clean or polluted air into New York, such as wind from the Atlantic (clean air) vs. polluted air from the mainland(polluted)?

Good observation, Fernando. I define the mode as the

globalmaximizer of the relative frequency function; thus, according to my definition, thelocalmaximizer at 75 ppb is not a mode. However, some people define local maximizers as modes.If a local maximum is just slightly less than the global maximum, then it would be sensible to consider that respective local maximizer as a mode, too. However, it is pretty clear that the mode at 25 ppb is more than twice as large as the local maximum at 75 ppb.

As for the cause of this local maximum, it’s best

notto make any causal inferences just from a histogram or any exploratory data analysis. It is, however, good to consider those possible causes as predictors in a regression model in further analyses.