# Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

August 19, 2013 6 Comments

#### Introduction

Data in R are often stored in data frames, because they can store multiple types of data. (In R, data frames are more general than matrices, because matrices can only store one type of data.) Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis. I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.

If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!

This post continues a recent series on exploratory data analysis. Previous posts in this series include

- Descriptive statistics
- Box plots
- The conceptual foundations of kernel density estimation
- How to construct kernel density plots and rug plots in R
- Violin plots
- The conceptual foundations of empirical cumulative distribution functions (CDFs)
- 2 ways of plotting empirical CDFs in R
- Conceptual foundations of histograms and how to plot them in R
- Combining histograms and density plots in R
- The 5-number summary: Differences between fivenum() and summary() in R

#### Useful Functions for Exploring Data Frames

Use dim() to obtain the dimensions of the data frame (number of rows and number of columns). The output is a vector.

> dim(InsectSprays) [1] 72 2

Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can get the same information by extracting the first and second element of the output vector from dim().

> nrow(InsectSprays) # same as dim(InsectSprays)[1] [1] 72 > ncol(InsectSprays) # same as dim(InsectSprays)[2] [1] 2

Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6. These are good commands for obtaining an intuitive idea of what the data look like without revealing the entire data set, which could have millions of rows and thousands of columns.

> head(InsectSprays, n = 5) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A

Let s be the number of observations. If you use a negative number for the “n” option in head(), you will obtain the first s+n observations. In the following example, since s = 72 and s = -62, the following command will return the first 10 observations; the calculation is

s+n = 72 + (-62) = 10.

> head(InsectSprays, n = -62) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A 7 10 A 8 23 A 9 17 A 10 20 A

Analogously, if you use a negative number for the “n” option in tail(), you will get the last s+n observations. For example, the following command will return the last 10 observations.

> tail(InsectSprays, n = -62) count spray 63 15 F 64 22 F 65 15 F 66 16 F 67 13 F 68 10 F 69 26 F 70 26 F 71 24 F 72 13 F

The names() function will return the column headers.

> names(InsectSprays)

[1] "count" "spray"

The str() function returns many useful pieces of information, including the above useful outputs and the types of data for each column. In this example, “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.

> str(InsectSprays) 'data.frame': 72 obs. of 2 variables: $ count: num 10 7 20 14 14 12 10 23 17 20 ... $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

To obtain all of the categories or levels of a categorical variable, use the levels() function.

> levels(InsectSprays$spray) [1] "A" "B" "C" "D" "E" "F"

When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together. For a continuous (numeric) variable like “count”, it returns the 5-number summary. (Read my previous post to learn how fivenum() and summary() return different 5-number summaries.) If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them. In this example, there are no missing values for “count”, so there is no display for the number of NA’s. For a categorical variable like “spray”, it returns the levels and the number of data in each level.

> summary(InsectSprays) count spray Min. : 0.00 A:12 1st Qu.: 3.00 B:12 Median : 7.00 C:12 Mean : 9.50 D:12 3rd Qu.:14.25 E:12 Max. :26.00 F:12

**Are there any other functions for exploring data frames that you like? If so, please share them in the comments!**

Reblogged this on nishant@analyst.

Is there any chance you could help me with this problem? I have a data frame containing a series of blood pressure readings (bp)

I want to count the number of readings falling into particular ranges eg how many bp readings in range 101-105, 106-110, 111-115 etc Pretty much what I would want to do if I was making my own histogram.

Could you point me in the direction of some possible R functions I could investigate please?

Hi Pete,

1) Create a new variable that categorizes bp.

2) Then, use count() in the ‘dplyr’ package to get the frequency table.

3) If you want to plot the absolute frequencies, use the barplot() function to get a bar plot.

4) If you wan to plot the relative frequencies, divide the absolute frequencies by the total count, and THEN use a bar plot.

I just wrote a blog post on how to use count().

https://chemicalstatistician.wordpress.com/2015/02/03/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/

Many thanks for your help – just what I needed!

You’re welcome, Pete! You also just gave me a good idea for a new blog post!

Pingback: The Hackathon Practice Guide by Analytics Vidhya