Eric’s Enlightenment for Monday, April 20, 2015

  1. John D. Cook explains why 0! is defined to be equal to 1.  This is also an excellent post on how definitions are created in mathematics.
  2. Why are GDP estimates often unreliable?  (Jonathan Jones wrote this report for Britain, but it’s likely applicable to all countries.)
  3. Rick Wicklin shares useful code for counting the number of missing and non-missing observations in a data set in SAS.
  4. A potentially game-changing breakthrough in artificial photosynthesis may be able to solve the world’s carbon emission problem…

Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R


Continuing my recent series on exploratory data analysis (EDA), and following up on the last post on the conceptual foundations of empirical cumulative distribution functions (CDFs), this post shows how to plot them in R.  (Previous posts in this series on EDA include descriptive statistics, box plots, kernel density estimation, and violin plots.)

I will plot empirical CDFs in 2 ways:

  1. using the built-in ecdf() and plot() functions in R
  2. calculating and plotting the cumulative probabilities against the ordered data

Continuing from the previous posts in this series on EDA, I will use the “Ozone” data from the built-in “airquality” data set in R.  Recall that this data set has missing values, and, just as before, this problem needs to be addressed when constructing plots of the empirical CDFs.

Recall the plot of the empirical CDF of random standard normal numbers in my earlier post on the conceptual foundations of empirical CDFs.  That plot will be compared to the plots of the empirical CDFs of the ozone data to check if they came from a normal distribution.

Read more of this post

Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City


This is the first of a series of posts on exploratory data analysis (EDA).  This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R.  This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets.  In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.

Read more of this post


Get every new post delivered to your Inbox.

Join 489 other followers