# Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R. Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

#### What is an Empirical Cumulative Distribution Function?

An empirical cumulative distribution function (CDF) is a non-parametric estimator of the underlying CDF of a random variable.  It assigns a probability of $1/n$ to each datum, orders the data from smallest to largest in value, and calculates the sum of the assigned probabilities up to and including each datum.  The result is a step function that increases by $1/n$ at each datum.

The empirical CDF is usually denoted by $\hat{F}_n(x)$ or $\hat{P}_n(X \leq x)$, and is defined as $\hat{F}_n(x) = \hat{P}_n(X \leq x) = n^{-1}\sum_{i=1}^{n} I(x_i \leq x)$ $I()$ is the indicator function.  It has 2 possible values: 1 if the event inside the brackets occurs, and 0 if not. $I(x_i \leq x) = \begin{cases} 1,&\text{when }x_i \leq x\\ 0,&\text{when }x_i > x \end{cases}$

Essentially, to calculate the value of $\hat{F}_n(x)$ at $x$,

1. count the number of data less than or equal to $x$
2. divide the number found in Step #1 by the total number of data in the sample

#### Why is the Empirical Cumulative Distribution Useful in Exploratory Data Analysis?

The empirical CDF is useful because

• it approximates the true CDF well if the sample size (the number of data) is large, and knowing the distribution is helpful for statistical inference
• a plot of the empirical CDF can be visually compared to known CDFs of frequently used distributions to check if the data came from one of those common distributions
• it can visually display “how fast” the CDF increases to 1; plotting key quantiles like the quartiles can be useful to “get a feel” for the data

#### Some Mathematical Statistics of the Empirical Distribution Function

Some appealing properties of the empirical CDF can be obtained from mathematical statistics.

1) For a fixed $x$, $I(x_i \leq x)$ is a Bernoulli random variable with a probability of $F(x)$ equalling 1.  Thus, its expected value is $E[I(X_i \leq x)] = P(X_i \leq x) = F(x)$,

which means that $I(x_i \leq x)$ is an unbiased estimator of $F(x)$ for a fixed $x$.  Also note that its variance is $V[I(X_i \leq x)] = F(x)[1 - F(x)]$.

2) By summation of all of these Bernoulli random variables, $\hat{F}_n(x)$ is a binomial random variable.  Thus, $E[\hat{F}_n(x)] = F(x)$, so $\hat{F}_n(x)$ is also an unbiased estimator of $F(x)$.

Also note that $V[\hat{F}_n(x)] = n^{-1}F(x)[1 - F(x)]$.

Thus, for a fixed $x$, $\hat{F}_n(x)$ has a lower variance than $I(X_i \leq x)$.

3) By the Glivenko-Cantelli theorem $\hat{F}_n(x)$ is a consistent estimator of $F(x)$.  In fact, $\hat{F}_n(x)$ converges uniformly to $F(x)$.

Here is the code for generating the plot of the empirical CDF of the random standard normal numbers; the plot is given again after the code.  For the sake of brevity, I will describe in detail how to generate this and other plots of empirical CDFs in a separate post; in fact, I will show 2 different ways of doing so in R!

##### Empirical Distribution Function
##### By Eric Cai - The Chemical Statistician
# set the seed for consistent replication of random numbers
set.seed(1)

# generate 100 random numbers from the standard normal distribution
normal.numbers = rnorm(100)

# empirical normal CDF of the 100 normal random numbers
normal.ecdf = ecdf(normal.numbers)

# plot normal.ecdf (notice that the only argument needed is normal.ecdf)
# use png() and dev.off() to print this plot to your chosen folder
png('INSERT YOUR DIRECTORY PATH HERE/ecdf standard normal.png')

plot(normal.ecdf, xlab = 'Quantiles of Random Standard Normal Numbers', ylab = '', main = 'Empirical Cumluative Distribution\nStandard Normal Quantiles')

# add label to y-axis with mtext()
# side = 2 denotes the left veritical axis
# line = 2.5 sets the position of the label
mtext(text = expression(hat(F)[n](x)), side = 2, line = 2.5)

dev.off()

### 13 Responses to Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

1. zak says:

what if there is a difference between the empirical and the normal?

• Eric Cai - The Chemical Statistician says:

Good question, Zak! If there is a difference between the empirical CDF and the normal CDF, then there is reason to suspect that another distribution would fit better, and it would be a good idea to plot the CDFs of other distributions. As shown in my next post on plotting the empirical CDFs of the ozone data, there is reason to suspect that the ozone data do not come from a normal distribution. In a later post on histograms, I will show how I eventually found another distribution that fits these data better. (If you read my previous posts in this EDA series carefully, you will find out what that distribution is!)

2. hypergeometric says:
• Eric Cai - The Chemical Statistician says:

Hello Jan,

Thank you for sharing this monograph with us. Could you please tell us what you would like to convey by posting it?

Thanks,

Eric

• hypergeometric says:

Yes, sorry for the terse response.

A complaint I often have about people presenting results with empirical distribution functions is that they believe that, given the sample, the EDF is the definitive EDF, a good approximation to the “true CDF” of the distribution. In fact, the EDF is a random variable just like any other statistic and, so, it is subject to variability. One of the things this implies is if two processes are plotted on the same graph, both represented by EDFs, it is generally not a simple matter to determine if one is “better than” another. In fact, that is an inference and a decision in itself.

While I prefer a Bayesian approach to the question, I have not found much written up on determining whether or not one EDF curve is strictly stochastically greater than another. The paper by Uusitalo made a beginning at this, although not from a Bayesian perspective but, rather, than of extreme values. They used to have a package available in R’s CRAN.

A related problem of interest is the notion of having multivariate EDF. While there are straightforward ways of defining this, e.g., http://www.icesi.edu.co/CRAN/web/packages/mecdf/vignettes/mecdf.pdf (package mecdf has unfortunately been withdrawn from CRAN), I’m interested in one which can be readily used in conjunction with Skilling’s nested sampling procedures, e.g., http://dx.doi.org/10.1111/j.1365-2966.2007.12353.x

• Eric Cai - The Chemical Statistician says:

Good point, Jan. I hope that I have not given the impression that the ECDF is a definitively good approximation of the CDF in my blog post; if I have, please do tell me how I can remove that confusion.

I am not familiar with inferences on ECDFs, including comparisons between 2 ECDFs. I welcome anyone to share information about this, both from a frequentist persective and a Bayesian perspective.

I aim to expand my series on exploratory data analysis to cover multivariate data. Please stay tuned!

Thanks for sharing!

Eric

• hypergeometric says:

The troubles in practice are often more in perception and conceptual than theoretical. I’m working with a customer today who is comparing performance of several processes and was doing it using ECDFs. Up to the 0.8 quantile, their preferred process is kicking the rest of them all over the field, but it turns out it does REALLY BADLY on the remaining 0.2, putting it in 3rd place. The Bayesian posterior shows that, since the probability mass gets put up into that long tail, and, so, has less at better regions of performance. But if you look at the lower portion, it looks wonderful.

3. Joel says:

Thanks for breaking down the ECDF for us. Is there a way to extract the x,y vectors from ecdf() in R? I’ve run it with a large vector of data (~10,000 values) and see that it creates an ecfd with about 150 points. I’d like to extract those point and compare them with other representations of the CDF.

Thanks.
Joel

4. nishant analyst says:

Reblogged this on nishant@analyst.

5. Sam says:

Thanks, in my ECDF plot the data points are distributed upto .84 only instead of reaching 1, is my plot correct ?