August 19, 2013 5 Comments
Data in R are often stored in data frames, because they can store multiple types of data. (In R, data frames are more general than matrices, because matrices can only store one type of data.) Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis. I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.
If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!
This post continues a recent series on exploratory data analysis. Previous posts in this series include
- Descriptive statistics
- Box plots
- The conceptual foundations of kernel density estimation
- How to construct kernel density plots and rug plots in R
- Violin plots
- The conceptual foundations of empirical cumulative distribution functions (CDFs)
- 2 ways of plotting empirical CDFs in R
- Conceptual foundations of histograms and how to plot them in R
- Combining histograms and density plots in R
- The 5-number summary: Differences between fivenum() and summary() in R
Useful Functions for Exploring Data Frames
Use dim() to obtain the dimensions of the data frame (number of rows and number of columns). The output is a vector.
> dim(InsectSprays)  72 2
Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can get the same information by extracting the first and second element of the output vector from dim().
> nrow(InsectSprays) # same as dim(InsectSprays)  72 > ncol(InsectSprays) # same as dim(InsectSprays)  2