← Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame →

Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

August 12, 2013 6 Comments

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series. I will define and calculate the 5-number summary in 2 different ways that are commonly used in R. (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.) I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function. This post expands on a recent comment that I made to correct an error in the post on box plots.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
     Min.   1st Qu.   Median    Mean     3rd Qu.    Max. 
     1.0     3.5       6.0       6.0      8.5       11.0

Why do these 2 methods of calculating the 5–number summary in R give different results? Read the rest of this post to find out the answer!

Previous posts in this series on EDA include

What is a 5-Number Summary?

A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set. It consists of the data set’s

minimum
1st quartile
median
3rd quartile
maximum

This is a simple but very useful way of summarizing your data for several reasons.

the median gives a measure of the centre of the data
the minimum and maximum give the range of the data
the 1st and 3rd quartiles give a sense of the spread of the data, especially when compared to the minimum, maximum, and median

2 Different Ways to Get the 5-Number Summary in R

There are 2 functions that are commonly used to calculate the 5-number summary in R.

I have discovered a subtle but important difference in the way the 5-number summary is calculated between these two functions.

Here is an instance when they provide the same output.

> x = seq(1, 9, by = 2)
> x
[1] 1 3 5 7 9
> fivenum(x)
[1] 1 3 5 7 9
> summary(x)
     Min.   1st Qu.   Median    Mean    3rd Qu.   Max. 
      1       3          5        5       7        9

Here is an instance when they provide different output.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
     Min.   1st Qu.   Median    Mean     3rd Qu.    Max. 
     1.0     3.5       6.0       6.0      8.5       11.0

*fivenum() does not have an argument for controlling the number of decimal places in its output, while summary() has the “digits” option for doing so. You may need to invoke this option in summary() to get more decimal places to when comparing its output with fivenum()’s output.

Notice that x has an odd number of data, while y has an even number of data. The 2 functions gave the same output for x, but different 1st and 3rd quartiles for y. What causes this difference?

The Difference Between fivenum() and summary()

The difference between fivenum() and summary() lies in the lack of universal agreement on how the 1st and 3rd quartiles should be calculated.

Here is how fivenum() calculates the 1st and 3rd quartiles.

Sort your data from smallest to largest
Find the median. If your data set has an odd number of data, then the median is the datum such that the number of data above the median is the same as the number of data below the median. If your data set has an even number, n, of data, the median is the average of the (n/2)th and (n/2 + 1)th largest data.
Find the set, L, of data below the median. The 1st quartile is the median of L.
Find the set, U, of data above the median. The 3rd quartile is the median of U.

summary() uses the quantile() function to calculate the 25% and 75% quantiles as the 1st and 3rd quartiles. Thus, let’s discuss how quantile() calculates quantiles. (See “Terminology Clarification” near the end of this post on the definitions of quantile and percentile.)

There is no universal agreement on how quantiles are calculated among statisticians (Hyndman and Fan, 1996). The quantile() function’s documentation shows 9 different ways to calculate quantiles, with Type 7 being used for summary(). Here is how Type 7 works:

Sort the data, $X_1, ... , X_n$ , from smallest to largest. Denote the ordered statistics as $X_{(1)}, ... , X_{(n)}$ .
Assign the minimum, $X_{(1)}$ , as the 0% quantile and the maximum, $X_{(n)}$ , as the 100% quantile.
The position of the q% quantile along the ordered data is at $1+(n-1)*q/100$ , where n is the sample size. Thus, the position of the 0% quantile is $1+(n-1)*0/100 = 1$ ; this is the first number along the ordered data, so the 0% quantile is the minimum. Denote this position as $\text{j}$ .
If the position, $\text{j}$ , from Step #3 is an integer, than simply extract the $\text{j}th$ ordered datum from the list of ordered data – this is the q% quantile.
If the position, $\text{j}$ , from Step #3 is not an integer, but a decimal number, then let’s find the 2 integers immediately below and above $\text{j}$ . Denote these integers as $\text{i}$ and $\text{k}$ , respectively. To be precise,

$\text{i} = floor(\text{j}), \ \ \ \text{k} = ceiling(\text{j})$

$\text{q\%-quantile} = X_{(i)} + (j-i)(X_{(k)} - X_{(i)})$

Distinguishing fivenum() and summary() – An Example

Consider again the data set y.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11

Let’s follow the above steps for summary() and find the 1st quartile accordingly.

y is already sorted in ascending order.
$Y_{(1)} = 1, Y_{(n)} = 11.$
The position of the 25% quantile is $1 + (6 - 1)25/100 = 2.25$ .
This position is not an integer, so we cannot simply extract the 2.25th ordered datum from y.

$floor(2.25) = 2, ceiling(2.25) = 3$

$25\%-\text{quantile} = Y_{(2)} + (2.25 - 2)(Y_{(3)} - Y_{(2)}) = 3 + 0.25(5 - 3) = 3.5$

Conclusion

The R functions fivenum() and summary() use different methods to calculate the 5-number summary. Given the complexity of summary()’s method and the ease of calculation and intepretation of fivenum()’s method, I encourage using fivenum(), and I will use it from now on in my blog posts.

I asked about the differences between these 2 methods by initiating a discussion thread called “5-Number Summaries in R” in the LinkedIn group “R Programming”. I thank Marco Biffino and Mukul Mehta for sharing valuable contributions in this thread. I also thank David Maxwell and Allan Reese from the Centre for Environment, Fisheries and Aquaculture Science in Great Britain for noting and explaining these issues in personal emails with me.

Terminology Clarifications

*Here is the definition of percentile that I learned in my introductory statistics class (STAT 270 at Simon Fraser University):

The pth percentile** of a data set sorted from smallest to largest is the value such that p percent of the data are at or below this value. The quartiles are special percentiles; the 1st quartile is the 25th percentile, and the 3rd quartile is the 75th percentile. The median is also a quartile – it is the 50th percentile.

**The terms quantile and percentile denote essentially the same thing. However, percentile refers to the percentage of the data at or below its value, while quantile refers to the fraction of data at or below its value. In the context of probability distributions and cumulative distribution functions (CDFs), I see “quantile” being used all the time, and rarely see “percentile” being used. (In the context of CDFs, quantiles are just the values of the random variable or, equivalently, the inverse CDF.) Nonetheless, they do mean the same things.

References

Ross Ihaka’s lecture slides on quantiles for Statistics 787 at the University of Auckland.
John Verzani. “simpleR – Using R for Introductory Statistics”
“Sample Quantiles in Statistical Packages” by Rob J. Hyndman and Yanan Fan. The American Statistician. Vol. 50, No. 4 (November, 1996), pp. 361-365

Filed under Descriptive Statistics, R programming, Statistics, Tutorials Tagged with 5-number summary, data, data analysis, descriptive statistics, exploratory data analysis, five-number summary, fivenum(), maximum, median, minimum, quantile, quartile, R, R programming, statistics, summary statistics, summary()

6 Responses to Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

nishant analyst says:

November 24, 2014 at 11:27 pm

Reblogged this on nishant@analyst.

Reply
Pingback: Basic descriptive statistics: Five-number summary
Pingback: Notes on five point summary | Selvaraj Vadivelu
tipsforbiostat says:

October 16, 2017 at 4:03 pm

Great summary. Just fyi, you can do it both with quartile method and Tukey’s hinge method in this web-based five number summary calculator: http://www.icalcu.com/stat/fivenum.html

Reply
Adam says:

September 26, 2019 at 1:40 am

How do you calculate the IQR using the ‘fivenum()’ function in R. I have used the ‘IQR()’ function but from testing this is done using the same data produced by the ‘summary()’ data. Do you know of a line of code to do it, or can that only be done manually?

Reply
- Eric Cai - The Chemical Statistician says:
  
  September 30, 2019 at 4:10 pm
  
  Hi Adam,
  
  I cannot find a one-line code to compute the IQR based on fivenum()’s output. Thus, I encourage you to write your own function to do so.
  
  Eric
  
  Reply

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

Introduction

What is a 5-Number Summary?

2 Different Ways to Get the 5-Number Summary in R

The Difference Between fivenum() and summary()

Distinguishing fivenum() and summary() – An Example

Conclusion

Terminology Clarifications

References

6 Responses to Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

Introduction

What is a 5-Number Summary?

2 Different Ways to Get the 5-Number Summary in R

The Difference Between fivenum() and summary()

Distinguishing fivenum() and summary() – An Example

Conclusion

Terminology Clarifications

References

Share this:

Related

6 Responses to Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories