Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame
August 19, 2013 6 Comments
Data in R are often stored in data frames, because they can store multiple types of data. (In R, data frames are more general than matrices, because matrices can only store one type of data.) Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis. I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.
If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!
This post continues a recent series on exploratory data analysis. Previous posts in this series include
- Descriptive statistics
- Box plots
- The conceptual foundations of kernel density estimation
- How to construct kernel density plots and rug plots in R
- Violin plots
- The conceptual foundations of empirical cumulative distribution functions (CDFs)
- 2 ways of plotting empirical CDFs in R
- Conceptual foundations of histograms and how to plot them in R
- Combining histograms and density plots in R
- The 5-number summary: Differences between fivenum() and summary() in R
Useful Functions for Exploring Data Frames
Use dim() to obtain the dimensions of the data frame (number of rows and number of columns). The output is a vector.
> dim(InsectSprays)  72 2
Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can get the same information by extracting the first and second element of the output vector from dim().
> nrow(InsectSprays) # same as dim(InsectSprays)  72 > ncol(InsectSprays) # same as dim(InsectSprays)  2
Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6. These are good commands for obtaining an intuitive idea of what the data look like without revealing the entire data set, which could have millions of rows and thousands of columns.
> head(InsectSprays, n = 5) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A
Let s be the number of observations. If you use a negative number for the “n” option in head(), you will obtain the first s+n observations. In the following example, since s = 72 and s = -62, the following command will return the first 10 observations; the calculation is
s+n = 72 + (-62) = 10.
> head(InsectSprays, n = -62) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A 7 10 A 8 23 A 9 17 A 10 20 A
Analogously, if you use a negative number for the “n” option in tail(), you will get the last s+n observations. For example, the following command will return the last 10 observations.
> tail(InsectSprays, n = -62) count spray 63 15 F 64 22 F 65 15 F 66 16 F 67 13 F 68 10 F 69 26 F 70 26 F 71 24 F 72 13 F
The names() function will return the column headers.
 "count" "spray"
The str() function returns many useful pieces of information, including the above useful outputs and the types of data for each column. In this example, “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.
> str(InsectSprays) 'data.frame': 72 obs. of 2 variables: $ count: num 10 7 20 14 14 12 10 23 17 20 ... $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
To obtain all of the categories or levels of a categorical variable, use the levels() function.
> levels(InsectSprays$spray)  "A" "B" "C" "D" "E" "F"
When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together. For a continuous (numeric) variable like “count”, it returns the 5-number summary. (Read my previous post to learn how fivenum() and summary() return different 5-number summaries.) If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them. In this example, there are no missing values for “count”, so there is no display for the number of NA’s. For a categorical variable like “spray”, it returns the levels and the number of data in each level.
> summary(InsectSprays) count spray Min. : 0.00 A:12 1st Qu.: 3.00 B:12 Median : 7.00 C:12 Mean : 9.50 D:12 3rd Qu.:14.25 E:12 Max. :26.00 F:12
Are there any other functions for exploring data frames that you like? If so, please share them in the comments!