## Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series.  I will define and calculate the 5-number summary in 2 different ways that are commonly used in R.  (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.)  I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function.  This post expands on a recent comment that I made to correct an error in the post on box plots.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
Min.   1st Qu.   Median    Mean     3rd Qu.    Max.
1.0     3.5       6.0       6.0      8.5       11.0

Why do these 2 methods of calculating the 5–number summary in R give different results?  Read the rest of this post to find out the answer!

Previous posts in this series on EDA include

## Using the Golden Section Search Method to Minimize the Sum of Absolute Deviations

#### Introduction

Recently, I introduced the golden search method – a special way to save computation time by modifying the bisection method with the golden ratio – and I illustrated how to minimize a cusped function with this script.  I also wrote an R function to implement this method and an R script to apply this method with an example.  Today, I will use apply this method to a statistical topic: minimizing the sum of absolute deviations with the median.

While reading Page 148 (Section 6.3) in Michael Trosset’s “An Introduction to Statistical Inference and Its Applications”, I learned 2 basic, simple, yet interesting theorems.

If X is a random variable with a population mean $\mu$ and a population median $q_2$, then

a) $\mu$ minimizes the function $f(c) = E[(X - c)^2]$

b) $q_2$ minimizes the function $h(c) = E(|X - c|)$

I won’t prove these theorems in this blog post (perhaps later), but I want to use the golden section search method to show a result similar to b):

c) The sample median, $\tilde{m}$, minimizes the function

$g(c) = \sum_{i=1}^{n} |X_i - c|$.

This is not surprising, of course, since

- $|X - c|$ is just a function of the random variable $X$

- by the law of large numbers,

$\lim_{n\to \infty}\sum_{i=1}^{n} |X_i - c| = E(|X - c|)$

Thus, if the median minimizes $E(|X - c|)$, then, intuitively, it minimizes $\lim_{n\to \infty}\sum_{i=1}^{n} |X_i - c|$.  Let’s show this with the golden section search method, and let’s explore any differences that may arise between odd-numbered and even-numbered data sets.