Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

Introduction

This is the first of a series of posts on exploratory data analysis (EDA).  This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R.  This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets.  In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.

The Original “CO2″ Data Set

I used the “Ozone” vector in the “airquality” data set that is built into R.  It’s always a good idea to get a sense of what a data table looks like by using the head() function; by default, it shows the first 6 data.

> head(airquality)
     Ozone  Solar.R   Wind     Temp   Month   Day
1    41     190       7.4      67     5       1
2    36     118       8.0      72     5       2
3    12     149       12.6     74     5       3
4    18     313       11.5     62     5       4
5    NA      NA       14.3     56     5       5
6    28      NA       14.9     66     5       6

*I have manually added spaces between the columns for ease of viewing.

To abstract the “Ozone” vector, just use the $ symbol.

> # extract "Ozone" data vector
> ozone = airquality$Ozone

Counting the Number of Data

I initially thought that counting the sample size would simply involve using the length() function.

> # sample size of "ozone'
> length(ozone)
[1] 153

However, the summary() function showed that it contains missing values.

> summary(ozone)
   Min.    1st Qu.     Median    Mean      3rd Qu.    Max.       NA's 
   1.00    18.00       31.50     42.13     63.25      168.00     37

Notice the last column; “NA” stands for “Not Available”, and this output shows that there are 37 missing values.

I found 3 different ways to find the number of non-missing values in “ozone”.  The last one is simplest.

> # 3 ways to find number of non-missing values in "ozone"
> length(ozone[is.na(ozone) == F])
[1] 116
> length(ozone[!is.na(ozone)])
[1] 116
> sum(!is.na(ozone))
[1] 116

This last function, sum(), takes advantage of the fact that “True” or “T” is coded as “1” and “False” or “F” is coded as “0” in R.  Thus, it adds the number of “1’s” that are in the vector of !is.na(ozone) to get the number of non-missing values.

Calculating the Summary Statistics

The summary() output above already shows the mean that is calculated after removing the missing values.  If you try to use the mean() function to calculate the mean, you will get this strange result:

> mean(ozone)
[1] NA

This is obviously the result of the missing values (the NA’s) being taken into account when computing the mean.  To compute the mean without the missing values, use the “na.rm” option.

> # calculate mean of "ozone" by excluding missing values
> mean(ozone, na.rm = T)
[1] 42.12931

This is also needed for the var() and sd() functions when calculating the variance and the standard deviation.

> var(ozone, na.rm = T)
[1] 1088.201
> sd(ozone, na.rm = T)
[1] 32.98788
About these ads

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 343 other followers

%d bloggers like this: