Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

Introduction

Data in R are often stored in data frames, because they can store multiple types of data.  (In R, data frames are more general than matrices, because matrices can only store one type of data.)  Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis.  I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.

If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!

This post continues a recent series on exploratory data analysis.  Previous posts in this series include

Useful Functions for Exploring Data Frames

Use dim() to obtain the dimensions of the data frame (number of rows and number of columns).  The output is a vector.

> dim(InsectSprays)
[1] 72 2

Use nrow() and ncol() to get the number of rows and number of columns, respectively.  You can get the same information by extracting the first and second element of the output vector from dim(). 

> nrow(InsectSprays) 
# same as dim(InsectSprays)[1]
[1] 72
> ncol(InsectSprays)
# same as dim(InsectSprays)[2]
[1] 2

Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6.  These are good commands for obtaining an intuitive idea of what the data look like without revealing the entire data set, which could have millions of rows and thousands of columns.

> head(InsectSprays, n = 5)
   count spray
1     10     A
2      7     A
3     20     A
4     14     A
5     14     A
6     12     A

Let s be the number of observations.  If you use a negative number for the “n” option in head(), you will obtain the first s+n observations.  In the following example, since s = 72 and s = -62, the following command will return the first 10 observations; the calculation is

s+n = 72 + (-62) = 10.

> head(InsectSprays, n = -62)
   count spray
1     10     A
2      7     A
3     20     A
4     14     A
5     14     A
6     12     A
7     10     A
8     23     A
9     17     A
10    20     A

Analogously, if you use a negative number for the “n” option in tail(), you will get the last s+n observations.  For example, the following command will return the last 10 observations.

> tail(InsectSprays, n = -62)
   count spray  
63    15     F
64    22     F
65    15     F
66    16     F
67    13     F
68    10     F
69    26     F
70    26     F
71    24     F
72    13     F

The names() function will return the column headers.

> names(InsectSprays)

[1] "count" "spray"

The str() function returns many useful pieces of information, including the above useful outputs and the types of data for each column.  In this example, “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.  

> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

To obtain all of the categories or levels of a categorical variable, use the levels() function.

> levels(InsectSprays$spray)
[1] "A" "B" "C" "D" "E" "F"

When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together.  For a continuous (numeric) variable like “count”, it returns the 5-number summary.  (Read my previous post to learn how fivenum() and summary() return different 5-number summaries.)   If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them.  In this example, there are no missing values for “count”, so there is no display for the number of NA’s.  For a categorical variable like “spray”, it returns the levels and the number of data in each level.  

> summary(InsectSprays)
count            spray
Min.   : 0.00    A:12
1st Qu.: 3.00    B:12
Median : 7.00    C:12
Mean   : 9.50    D:12
3rd Qu.:14.25    E:12
Max.   :26.00    F:12

Are there any other functions for exploring data frames that you like?  If so, please share them in the comments!

6 Responses to Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

  1. Pete Withers says:

    Is there any chance you could help me with this problem? I have a data frame containing a series of blood pressure readings (bp)
    I want to count the number of readings falling into particular ranges eg how many bp readings in range 101-105, 106-110, 111-115 etc Pretty much what I would want to do if I was making my own histogram.
    Could you point me in the direction of some possible R functions I could investigate please?

  2. Pingback: The Hackathon Practice Guide by Analytics Vidhya

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: