# How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

#### Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions.  For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate.  This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this.  Sadly, I could not find such a function.  In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

#### The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example.  The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear.  We can use the subset() function to restrict the data set to show just the row names and “gear”.

```> head(subset(mtcars, select = 'gear'))
gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Valiant              3```

What are the possible values of “gear”?  Let’s use the factor() function to find out.

```> factor(mtcars\$gear)
 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5```

The cars in this data set have either 3, 4 or 5 forward gears.  How many cars are there for each number of forward gears?

#### Why the table() function does not work well

The table() function in Base R does give the counts of a categorical variable, but the output is not a data frame – it’s a table, and it’s not easily accessible like a data frame.

```> w = table(mtcars\$gear)
> w
3 4 5
15 12 5

> class(w)
 "table"```

You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name.

```> t = as.data.frame(w)
> t
Var1 Freq
1   3    15
2   4    12
3   5    5```

You can correct this problem with the names() function.

```> names(t) = 'gear'
> t
gear Freq
1   3    15
2   4    12
3   5    5```

I finally have what I want, but that took several functions to accomplish.  Is there an easier way?

Update: As David commented, you can use the “dnn” option in the table() function to add the column name.

#### count() to the Rescue!  (With Complements to the “plyr” Package)

Thankfully, there is an easier way – it’s the count() function in the “plyr” package.  If you don’t already have the “plyr” package, install it first by running the command

` install.packages('plyr')`

Then, call its library, and the count() function will be ready for use.

```> library(plyr)
> count(mtcars, 'gear')
gear      freq
1      3         15
2      4         12
3      5         5
> y = count(mtcars, 'gear')
> y
gear      freq
1      3         15
2      4         12
3      5         5
> class(y)
 "data.frame"```

As the class() function confirms, this output is indeed a data frame!

### 30 Responses to How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

1. Pingback: Condensed News | Data Analytics & R

2. Kent Russell says:

Just a couple of other ways to do this.

table(mtcars\$gear,dnn=”gear”)
xtabs(~gear,mtcars)

With dplyr which will be way faster than plyr.

library(dplyr)
mtcars %>%
group_by( gear ) %>%
tally

3. neuwirthe says:

As much as I appreciate plyr, and even more dplyr, there are solutions in base R.
These are not, however, oneliners.

One issue is mtcars\$gear is a vector and does not have attributes any more.
So table cannot be expected to preserve the name.

as.data.frame(table(subset(mtcars,select=’gear’)))

should work, but does not preserve the variable name.
This is surprising since

as.data.frame(table(subset(mtcars,select=c(‘gear’,’cyl’))))

works as needed.
The problem seems to be that dataframes with just one variable are not treated
as they should be in table.

Since table has the argument dnn for dimension names,
one can use the ugly solution

as.data.frame(table(subset(mtcars,select=’gear’),dnn=’gear’))

Perhaps you should post the problem with names not being preserved to r-devel.
It might be considered a bug.

4. neuwirthe says:

library(dplyr)
mtcars %>%
group_by(gear) %>%
summarise(count=n())

is my preferred solution

• m devlin (@dev_dmu) says:

library(‘dplyr’)
library(‘tidyr’)

# in dplyr count() does group_by and summarise by n() or sum(n) impicitly:
# the fun begins when counting more than 1 variable or
# combined with mutate or spread (from tidyr)

mtcars %>% count(gear) %>% mutate(prop = n/sum(n))

# same as:
mtcars %>% count(am, gear) %>% mutate(prop = prop.table(n)) # !!! it’s always margin 1

mtcars %>% count(am, gear) %>% mutate(prop = n/sum(n))
mtcars %>% count(am, gear) %>% spread(gear, n)

• Ankit Banal says:

what is n denotes here

5. stevenvannoy says:

Thanks, an easy solution to a common task.

6. David says:

This requires 1 simple line of code with table:

data.frame(table(mtcars\$gear, dnn=’gear’))

I don’t see why you would want to install/load plyr if this is all you’re doing.

• Eric Cai - The Chemical Statistician says:

I should have checked the documentation for table() to learn the “dnn” option – thanks for sharing this, David.

While the above does work, it’s still longer, less intuitive and more cumbersome than count(mtcars, ‘gear’). In a later blog post, I will discuss the advantages of count() for displaying cross-tabulations in the list format – and these advantages over table() will be magnified.

Stay tuned, and thanks for commenting.

• krzysztofprzygodzki says:

For me this one is more intuitive, because you declare names in this way in matrix(), data.frame().

as.data.frame.table( table( gear = mtcars\$gear))
gear Freq
1 3 15
2 4 12
3 5 5

7. Minato Nakazawa says:

The same result can be generated by the follows:
as.data.frame(xtabs(~gear, data=mtcars))

• Eric Cai - The Chemical Statistician says:

I didn’t know this, Minato – I always just used xtabs() for cross-tabulations, not one-way frequencies. Thanks!

I still find count(mtcars, ‘gear’) to be shorter and more intuitive.

8. Jesus Castagnetto says:

An alternative solution to the one presented, would be to use dplyr:

> library(dplyr)

> ft % group_by(gear) %>% summarise(freq=n())

> ft
Source: local data frame [3 x 2]

gear freq
1 3 15
2 4 12
3 5 5﻿

9. edwcarney says:

I think that plyr::count is worth knowing, and I’m happy to have come across Eric’s suggestion. It also will take a formula as the second parameter. This generates a flat list instead of a crosstabulation table.

Here’s the result for count(mtcars,gear~disp)

gear cyl freq
1 3 4 1
2 3 6 2
3 3 8 12
4 4 4 8
5 4 6 4
6 5 4 2
7 5 6 1
8 5 8 2

Try count(mtcars,gear~cyl+disp) to see the benefit of this over base::table.

10. Wojciech Sobala says:

with(mtcars, as.data.frame(table(gear)))

11. KCEric says:

Is there a way to do this but to have it generate a distribution %, like prop.table would give you?

For example:

gear freq
1 3 15
2 4 12
3 5 5

Would become:

gear freq-%
1 3 47%
2 4 37%
3 5 15.5%

• Eric Cai - The Chemical Statistician says:

Hi KCEric,

I can’t find a function that can do this, but I wrote one for you. “x” is the data frame that you produce from count().

library(plyr)

y = count(mtcars, ‘gear’)

prop = function(x)
{
x\$prop = x\$freq/sum(x\$freq)
x\$perc = x\$freq/sum(x\$freq)*100

return(x)
}

> prop(y)
gear freq prop perc
1 3 15 0.46875 46.875
2 4 12 0.37500 37.500
3 5 5 0.15625 15.625

• KCEric says:

Thank you so much! This was extremely helpful! I had been searching for a solution for a month because I’m very new to R.

• Eric Cai - The Chemical Statistician says:

You’re welcome, KCEric. Glad that I could help!

12. Stefan Fritsch says:

Hi,

the current situation in R is that you should use either dplyr or data.table for most data manipulation tasks. dplyr’s been mentioned already but with half a dozen variations so I’ll repeat the shortest solutions. 🙂

# dplyr:

library(dplyr)
count(mtcars, gear)

# data.table:

# Only necessary if you don’t use a data.table already
library(data.table)
dtcars <- as.data.table(mtcars)

dtcars[, .N,, gear]

## Percentages:

library(magrittr)
percentify % {.*100} %>% round(1) %>% paste0(“%”)

# dplyr:

mtcars %>%
count(gear) %>%
mutate(prop = n / sum(n), perc = percentify(prop))

# data.table:

dtcars[, .N,, gear][, prop := N/sum(N)][, perc := percentify(prop)] %>% print

(you need neither the pipe nor the print but the first makes it more readable and without the second the output is silent).

Ciao,
Stefan

13. Thomas says:

Is there any way to specify a frequency threshold? Let’s say I have a categorical variable ‘word’ that has 5 possible words and I want to find out which words only occur once. Is this possible?

I ask because I’m a linguist, and when we analyze naturally occurring speech we can end up with a data set containing over 300 unique words.

• Eric Cai - The Chemical Statistician says:

Hi Thomas,

Since the output of count() is an object that you can manipulate, you can then use a variety of functions to subset the output based on your desired condition. Here is an example that continues from my blog post. I will restrict the output to only the gear that occurs 5 times.

> y = count(mtcars, ‘gear’)
> y
gear freq
1 3 15
2 4 12
3 5 5
> subset(y, freq == 5)
gear freq
3 5 5

Hope this helps!

Eric

14. Cass says:

really clear explanation – thanks!

15. Ned says:

Thanks guys, Old post but hopefully I can gain from your experience. is there a way to convert a frequency table back to it’s original form

• Eric Cai - The Chemical Statistician says:

Hi Ned,

I’m sorry for the late reply. Try this:

> w = data.frame(value = y\$gear, freq = y\$freq)
> u = rep(w\$value, w\$freq)
> u
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5

• Eric Cai - The Chemical Statistician says:

Note that it is impossible to reverse the frequency table to assign each value to the original identifier. The best that you can do is get a vector of the values, like what I did above.