Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City
May 19, 2013 Leave a comment
This is the first of a series of posts on exploratory data analysis (EDA). This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R. This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets. In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.
The Original “CO2″ Data Set
I used the “Ozone” vector in the “airquality” data set that is built into R. It’s always a good idea to get a sense of what a data table looks like by using the head() function; by default, it shows the first 6 data.
> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
*I have manually added spaces between the columns for ease of viewing.
To abstract the “Ozone” vector, just use the $ symbol.
> # extract "Ozone" data vector > ozone = airquality$Ozone
Counting the Number of Data
I initially thought that counting the sample size would simply involve using the length() function.
> # sample size of "ozone' > length(ozone)  153
However, the summary() function showed that it contains missing values.
> summary(ozone) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.00 18.00 31.50 42.13 63.25 168.00 37
Notice the last column; “NA” stands for “Not Available”, and this output shows that there are 37 missing values.
I found 3 different ways to find the number of non-missing values in “ozone”. The last one is simplest.
> # 3 ways to find number of non-missing values in "ozone" > length(ozone[is.na(ozone) == F])  116 > length(ozone[!is.na(ozone)])  116 > sum(!is.na(ozone))  116
This last function, sum(), takes advantage of the fact that “True” or “T” is coded as “1″ and “False” or “F” is coded as “0″ in R. Thus, it adds the number of “1′s” that are in the vector of !is.na(ozone) to get the number of non-missing values.
Calculating the Summary Statistics
The summary() output above already shows the mean that is calculated after removing the missing values. If you try to use the mean() function to calculate the mean, you will get this strange result:
> mean(ozone)  NA
This is obviously the result of the missing values (the NA’s) being taken into account when computing the mean. To compute the mean without the missing values, use the “na.rm” option.
> # calculate mean of "ozone" by excluding missing values > mean(ozone, na.rm = T)  42.12931
> var(ozone, na.rm = T)  1088.201 > sd(ozone, na.rm = T)  32.98788