← Spatial Statistics Seminar in Toronto – Tuesday, May 21, 2013 @ SAS Canada Headquarters

When Does the Kinetic Theory of Gases Fail? Examining its Postulates with Assistance from Simple Linear Regression in R →

Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

May 19, 2013 5 Comments

Introduction

This is the first of a series of posts on exploratory data analysis (EDA). This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R. This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets. In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.

The Original “CO2” Data Set

I used the “Ozone” vector in the “airquality” data set that is built into R. It’s always a good idea to get a sense of what a data table looks like by using the head() function; by default, it shows the first 6 data.

> head(airquality)
     Ozone  Solar.R   Wind     Temp   Month   Day
1    41     190       7.4      67     5       1
2    36     118       8.0      72     5       2
3    12     149       12.6     74     5       3
4    18     313       11.5     62     5       4
5    NA      NA       14.3     56     5       5
6    28      NA       14.9     66     5       6

*I have manually added spaces between the columns for ease of viewing.

To abstract the “Ozone” vector, just use the $ symbol.

> # extract "Ozone" data vector
> ozone = airquality$Ozone

Counting the Number of Data

I initially thought that counting the sample size would simply involve using the length() function.

> # sample size of "ozone'
> length(ozone)
[1] 153

However, the summary() function showed that it contains missing values.

> summary(ozone)
   Min.    1st Qu.     Median    Mean      3rd Qu.    Max.       NA's 
   1.00    18.00       31.50     42.13     63.25      168.00     37

Notice the last column; “NA” stands for “Not Available”, and this output shows that there are 37 missing values.

I found 3 different ways to find the number of non-missing values in “ozone”. The last one is simplest.

> # 3 ways to find number of non-missing values in "ozone"
> length(ozone[is.na(ozone) == F])
[1] 116
> length(ozone[!is.na(ozone)])
[1] 116
> sum(!is.na(ozone))
[1] 116

This last function, sum(), takes advantage of the fact that “True” or “T” is coded as “1” and “False” or “F” is coded as “0” in R. Thus, it adds the number of “1’s” that are in the vector of !is.na(ozone) to get the number of non-missing values.

Calculating the Summary Statistics

The summary() output above already shows the mean that is calculated after removing the missing values. If you try to use the mean() function to calculate the mean, you will get this strange result:

> mean(ozone)
[1] NA

This is obviously the result of the missing values (the NA’s) being taken into account when computing the mean. To compute the mean without the missing values, use the “na.rm” option.

> # calculate mean of "ozone" by excluding missing values
> mean(ozone, na.rm = T)
[1] 42.12931

This is also needed for the var() and sd() functions when calculating the variance and the standard deviation.

> var(ozone, na.rm = T)
[1] 1088.201
> sd(ozone, na.rm = T)
[1] 32.98788

Filed under Descriptive Statistics, R programming, Statistics, Tutorials Tagged with CO2, data, data analysis, descriptive statistics, exploratory data analysis, head(), length(), mean(), missing data, missing values, ozone, R, R programming, sd(), statistics, sum(), summary(), var()

5 Responses to Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

nishant analyst says:

November 25, 2014 at 2:49 am

Reblogged this on nishant@analyst.

Reply
Katharine says:

September 29, 2017 at 7:23 am

Wow! Eric! Thank you for explaining things so well, somehow the way you explain things works for my poor old brain!

Reply
- Katharine says:
  
  September 29, 2017 at 7:26 am
  
  Also…. i am trying to make one of those fabulous violin plots that you make herehttps://www.r-bloggers.com/exploratory-data-analysis-combining-box-plots-and-kernel-density-plots-into-violin-plots-for-ozone-pollution-data/ for some very simple data i am not getting any errors but the plots are not appearing… rooky mistake i wondered if you can give me a hint 🙂
  
  Reply
  - Eric Cai - The Chemical Statistician says:
    
    September 29, 2017 at 3:25 pm
    
    Hi Katherine,
    
    It’s hard to diagnose what is wrong without reading your script. Please share your code in a comment under my blog post on violin plots.
    
    Exploratory Data Analysis: Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data
    
    Try also searching phrases like “R plot not showing” in Google – do any of those results help?
- Eric Cai - The Chemical Statistician says:
  
  September 29, 2017 at 3:19 pm
  
  You’re very welcome, Katherine!
  
  Reply

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

Introduction

The Original “CO2” Data Set

Counting the Number of Data

Calculating the Summary Statistics

5 Responses to Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

Introduction

The Original “CO2” Data Set

Counting the Number of Data

Calculating the Summary Statistics

Share this:

Related

5 Responses to Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories