How to Get the Frequency Table of a Categorical Variable as a Data Frame in R
February 3, 2015 32 Comments
Introduction
One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.
I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.
The Example Data Set – mtcars
Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.
> head(subset(mtcars, select = 'gear')) gear Mazda RX4 4 Mazda RX4 Wag 4 Datsun 710 4 Hornet 4 Drive 3 Hornet Sportabout 3 Valiant 3
What are the possible values of “gear”? Let’s use the factor() function to find out.
> factor(mtcars$gear) [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4 Levels: 3 4 5
The cars in this data set have either 3, 4 or 5 forward gears. How many cars are there for each number of forward gears?
Why the table() function does not work well
The table() function in Base R does give the counts of a categorical variable, but the output is not a data frame – it’s a table, and it’s not easily accessible like a data frame.
> w = table(mtcars$gear) > w 3 4 5 15 12 5 > class(w) [1] "table"
You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name.
> t = as.data.frame(w) > t Var1 Freq 1 3 15 2 4 12 3 5 5
You can correct this problem with the names() function.
> names(t)[1] = 'gear' > t gear Freq 1 3 15 2 4 12 3 5 5
I finally have what I want, but that took several functions to accomplish. Is there an easier way?
Update: As David commented, you can use the “dnn” option in the table() function to add the column name.
count() to the Rescue! (With Complements to the “plyr” Package)
Thankfully, there is an easier way – it’s the count() function in the “plyr” package. If you don’t already have the “plyr” package, install it first by running the command
install.packages('plyr')
Then, call its library, and the count() function will be ready for use.
> library(plyr) > count(mtcars, 'gear') gear freq 1 3 15 2 4 12 3 5 5 > y = count(mtcars, 'gear') > y gear freq 1 3 15 2 4 12 3 5 5 > class(y) [1] "data.frame"
As the class() function confirms, this output is indeed a data frame!
Pingback: Condensed News | Data Analytics & R
Just a couple of other ways to do this.
table(mtcars$gear,dnn=”gear”)
xtabs(~gear,mtcars)
With dplyr which will be way faster than plyr.
library(dplyr)
mtcars %>%
group_by( gear ) %>%
tally
As much as I appreciate plyr, and even more dplyr, there are solutions in base R.
These are not, however, oneliners.
One issue is mtcars$gear is a vector and does not have attributes any more.
So table cannot be expected to preserve the name.
as.data.frame(table(subset(mtcars,select=’gear’)))
should work, but does not preserve the variable name.
This is surprising since
as.data.frame(table(subset(mtcars,select=c(‘gear’,’cyl’))))
works as needed.
The problem seems to be that dataframes with just one variable are not treated
as they should be in table.
Since table has the argument dnn for dimension names,
one can use the ugly solution
as.data.frame(table(subset(mtcars,select=’gear’),dnn=’gear’))
Perhaps you should post the problem with names not being preserved to r-devel.
It might be considered a bug.
Let me add:
library(dplyr)
mtcars %>%
group_by(gear) %>%
summarise(count=n())
is my preferred solution
library(‘dplyr’)
library(‘tidyr’)
# in dplyr count() does group_by and summarise by n() or sum(n) impicitly:
# the fun begins when counting more than 1 variable or
# combined with mutate or spread (from tidyr)
mtcars %>% count(gear) %>% mutate(prop = n/sum(n))
# same as:
mtcars %>% count(am, gear) %>% mutate(prop = prop.table(n)) # !!! it’s always margin 1
mtcars %>% count(am, gear) %>% mutate(prop = n/sum(n))
mtcars %>% count(am, gear) %>% spread(gear, n)
what is n denotes here
Thanks, an easy solution to a common task.
This requires 1 simple line of code with table:
data.frame(table(mtcars$gear, dnn=’gear’))
I don’t see why you would want to install/load plyr if this is all you’re doing.
I should have checked the documentation for table() to learn the “dnn” option – thanks for sharing this, David.
While the above does work, it’s still longer, less intuitive and more cumbersome than count(mtcars, ‘gear’). In a later blog post, I will discuss the advantages of count() for displaying cross-tabulations in the list format – and these advantages over table() will be magnified.
Stay tuned, and thanks for commenting.
For me this one is more intuitive, because you declare names in this way in matrix(), data.frame().
as.data.frame.table( table( gear = mtcars$gear))
gear Freq
1 3 15
2 4 12
3 5 5
The same result can be generated by the follows:
as.data.frame(xtabs(~gear, data=mtcars))
I didn’t know this, Minato – I always just used xtabs() for cross-tabulations, not one-way frequencies. Thanks!
I still find count(mtcars, ‘gear’) to be shorter and more intuitive.
An alternative solution to the one presented, would be to use dplyr:
> library(dplyr)
> ft % group_by(gear) %>% summarise(freq=n())
> ft
Source: local data frame [3 x 2]
gear freq
1 3 15
2 4 12
3 5 5
Hmm, code got mangled somehow, so I put it at the URL: https://gist.github.com/jmcastagnetto/e0f1a0cde76bfac2206d
I think that plyr::count is worth knowing, and I’m happy to have come across Eric’s suggestion. It also will take a formula as the second parameter. This generates a flat list instead of a crosstabulation table.
Here’s the result for count(mtcars,gear~disp)
gear cyl freq
1 3 4 1
2 3 6 2
3 3 8 12
4 4 4 8
5 4 6 4
6 5 4 2
7 5 6 1
8 5 8 2
Try count(mtcars,gear~cyl+disp) to see the benefit of this over base::table.
Yes! That will come in another blog post, but thanks for sharing, Edward!
with(mtcars, as.data.frame(table(gear)))
Is there a way to do this but to have it generate a distribution %, like prop.table would give you?
For example:
gear freq
1 3 15
2 4 12
3 5 5
Would become:
gear freq-%
1 3 47%
2 4 37%
3 5 15.5%
Hi KCEric,
I can’t find a function that can do this, but I wrote one for you. “x” is the data frame that you produce from count().
library(plyr)
y = count(mtcars, ‘gear’)
prop = function(x)
{
x$prop = x$freq/sum(x$freq)
x$perc = x$freq/sum(x$freq)*100
return(x)
}
> prop(y)
gear freq prop perc
1 3 15 0.46875 46.875
2 4 12 0.37500 37.500
3 5 5 0.15625 15.625
Thank you so much! This was extremely helpful! I had been searching for a solution for a month because I’m very new to R.
You’re welcome, KCEric. Glad that I could help!
Hi,
the current situation in R is that you should use either dplyr or data.table for most data manipulation tasks. dplyr’s been mentioned already but with half a dozen variations so I’ll repeat the shortest solutions. 🙂
# dplyr:
library(dplyr)
count(mtcars, gear)
# data.table:
# Only necessary if you don’t use a data.table already
library(data.table)
dtcars <- as.data.table(mtcars)
dtcars[, .N,, gear]
## Percentages:
library(magrittr)
percentify % {.*100} %>% round(1) %>% paste0(“%”)
# dplyr:
mtcars %>%
count(gear) %>%
mutate(prop = n / sum(n), perc = percentify(prop))
# data.table:
dtcars[, .N,, gear][, prop := N/sum(N)][, perc := percentify(prop)] %>% print
(you need neither the pipe nor the print but the first makes it more readable and without the second the output is silent).
Ciao,
Stefan
Thanks, Stefan! I really like dplyr, too, and I appreciate you taking the time to write this example!
Is there any way to specify a frequency threshold? Let’s say I have a categorical variable ‘word’ that has 5 possible words and I want to find out which words only occur once. Is this possible?
I ask because I’m a linguist, and when we analyze naturally occurring speech we can end up with a data set containing over 300 unique words.
Hi Thomas,
Since the output of count() is an object that you can manipulate, you can then use a variety of functions to subset the output based on your desired condition. Here is an example that continues from my blog post. I will restrict the output to only the gear that occurs 5 times.
> y = count(mtcars, ‘gear’)
> y
gear freq
1 3 15
2 4 12
3 5 5
> subset(y, freq == 5)
gear freq
3 5 5
Hope this helps!
Eric
really clear explanation – thanks!
You’re welcome, Cass!
Thanks guys, Old post but hopefully I can gain from your experience. is there a way to convert a frequency table back to it’s original form
Hi Ned,
I’m sorry for the late reply. Try this:
> w = data.frame(value = y$gear, freq = y$freq)
> u = rep(w$value, w$freq)
> u
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5
Note that it is impossible to reverse the frequency table to assign each value to the original identifier. The best that you can do is get a vector of the values, like what I did above.
Is there a way to count values based on a vector with item levels? I’m thinking about a syntax such as
count(mtcars,’gear’,c(0,3,4))
which should count the values of gear 0 (of which there is none!), gear 3 and gear 4.
Thanks in advance, Martin
Hi Martin – I don’t know of such a method, but I think that it would be easier for you to get the full output of count(), THEN filter for the particular values that you seek. Does that make sense?