← Free Seminar on Sports Analytics + Free Food @ Boston Pizza in New Westminster: 7 pm, Friday, January 23, 2015

The advantages of using count() to get N-way frequency tables as data frames in R →

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

February 3, 2015 32 Comments

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

What are the possible values of “gear”? Let’s use the factor() function to find out.

> factor(mtcars$gear)
 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5

The cars in this data set have either 3, 4 or 5 forward gears. How many cars are there for each number of forward gears?

Why the table() function does not work well

The table() function in Base R does give the counts of a categorical variable, but the output is not a data frame – it’s a table, and it’s not easily accessible like a data frame.

> w = table(mtcars$gear)
> w
3 4 5 
15 12 5 

> class(w)
[1] "table"

You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name.

> t = as.data.frame(w)
> t
    Var1 Freq
1   3    15
2   4    12
3   5    5

You can correct this problem with the names() function.

> names(t)[1] = 'gear'
> t
    gear Freq
1   3    15
2   4    12
3   5    5

I finally have what I want, but that took several functions to accomplish. Is there an easier way?

Update: As David commented, you can use the “dnn” option in the table() function to add the column name.

count() to the Rescue! (With Complements to the “plyr” Package)

Thankfully, there is an easier way – it’s the count() function in the “plyr” package. If you don’t already have the “plyr” package, install it first by running the command

 install.packages('plyr')

Then, call its library, and the count() function will be ready for use.

> library(plyr)
> count(mtcars, 'gear')
       gear      freq
1      3         15
2      4         12
3      5         5
> y = count(mtcars, 'gear')
> y
       gear      freq
1      3         15
2      4         12
3      5         5
> class(y)
[1] "data.frame"

As the class() function confirms, this output is indeed a data frame!

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with categorical variable, class(), count, data frame, factor, frequency table, install.packages(0, mtcars, names(), plyr, R, R programming, subset, table()

32 Responses to How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Pingback: Condensed News | Data Analytics & R
Kent Russell says:

February 4, 2015 at 6:23 am

Just a couple of other ways to do this.

table(mtcars$gear,dnn=”gear”)
xtabs(~gear,mtcars)

With dplyr which will be way faster than plyr.

library(dplyr)
mtcars %>%
group_by( gear ) %>%
tally

Reply
neuwirthe says:

February 4, 2015 at 7:23 am

As much as I appreciate plyr, and even more dplyr, there are solutions in base R.
These are not, however, oneliners.

One issue is mtcars$gear is a vector and does not have attributes any more.
So table cannot be expected to preserve the name.

as.data.frame(table(subset(mtcars,select=’gear’)))

should work, but does not preserve the variable name.
This is surprising since

as.data.frame(table(subset(mtcars,select=c(‘gear’,’cyl’))))

works as needed.
The problem seems to be that dataframes with just one variable are not treated
as they should be in table.

Since table has the argument dnn for dimension names,
one can use the ugly solution

as.data.frame(table(subset(mtcars,select=’gear’),dnn=’gear’))

Perhaps you should post the problem with names not being preserved to r-devel.
It might be considered a bug.

Reply
neuwirthe says:

February 4, 2015 at 7:33 am

Let me add:

library(dplyr)
mtcars %>%
group_by(gear) %>%
summarise(count=n())

is my preferred solution

Reply
- m devlin (@dev_dmu) says:
  
  February 5, 2015 at 12:06 pm
  
  library(‘dplyr’)
  library(‘tidyr’)
  
  # in dplyr count() does group_by and summarise by n() or sum(n) impicitly:
  # the fun begins when counting more than 1 variable or
  # combined with mutate or spread (from tidyr)
  
  mtcars %>% count(gear) %>% mutate(prop = n/sum(n))
  
  # same as:
  mtcars %>% count(am, gear) %>% mutate(prop = prop.table(n)) # !!! it’s always margin 1
  
  mtcars %>% count(am, gear) %>% mutate(prop = n/sum(n))
  mtcars %>% count(am, gear) %>% spread(gear, n)
  
  Reply
- Ankit Banal says:
  
  May 21, 2018 at 8:59 am
  
  what is n denotes here
  
  Reply
stevenvannoy says:

February 4, 2015 at 7:54 am

Thanks, an easy solution to a common task.

Reply
David says:

February 4, 2015 at 7:55 am

This requires 1 simple line of code with table:

data.frame(table(mtcars$gear, dnn=’gear’))

I don’t see why you would want to install/load plyr if this is all you’re doing.

Reply
- Eric Cai - The Chemical Statistician says:
  
  February 4, 2015 at 10:29 am
  
  I should have checked the documentation for table() to learn the “dnn” option – thanks for sharing this, David.
  
  While the above does work, it’s still longer, less intuitive and more cumbersome than count(mtcars, ‘gear’). In a later blog post, I will discuss the advantages of count() for displaying cross-tabulations in the list format – and these advantages over table() will be magnified.
  
  Stay tuned, and thanks for commenting.
  
  Reply
- krzysztofprzygodzki says:
  
  February 4, 2015 at 11:33 am
  
  For me this one is more intuitive, because you declare names in this way in matrix(), data.frame().
  
  as.data.frame.table( table( gear = mtcars$gear))
  gear Freq
  1 3 15
  2 4 12
  3 5 5
  
  Reply
Minato Nakazawa says:

February 4, 2015 at 8:02 am

The same result can be generated by the follows:
as.data.frame(xtabs(~gear, data=mtcars))

Reply
- Eric Cai - The Chemical Statistician says:
  
  February 4, 2015 at 10:30 am
  
  I didn’t know this, Minato – I always just used xtabs() for cross-tabulations, not one-way frequencies. Thanks!
  
  I still find count(mtcars, ‘gear’) to be shorter and more intuitive.
  
  Reply
Jesus Castagnetto says:

February 4, 2015 at 9:01 am

An alternative solution to the one presented, would be to use dplyr:

> library(dplyr)

> ft % group_by(gear) %>% summarise(freq=n())

> ft
Source: local data frame [3 x 2]

gear freq
1 3 15
2 4 12
3 5 5

Reply
- Jesus Castagnetto says:
  
  February 4, 2015 at 9:05 am
  
  Hmm, code got mangled somehow, so I put it at the URL: https://gist.github.com/jmcastagnetto/e0f1a0cde76bfac2206d
  
  Reply
edwcarney says:

February 5, 2015 at 10:27 am

I think that plyr::count is worth knowing, and I’m happy to have come across Eric’s suggestion. It also will take a formula as the second parameter. This generates a flat list instead of a crosstabulation table.

Here’s the result for count(mtcars,gear~disp)

gear cyl freq
1 3 4 1
2 3 6 2
3 3 8 12
4 4 4 8
5 4 6 4
6 5 4 2
7 5 6 1
8 5 8 2

Try count(mtcars,gear~cyl+disp) to see the benefit of this over base::table.

Reply
- Eric Cai - The Chemical Statistician says:
  
  February 5, 2015 at 12:34 pm
  
  Yes! That will come in another blog post, but thanks for sharing, Edward!
  
  Reply
Wojciech Sobala says:

February 5, 2015 at 12:14 pm

with(mtcars, as.data.frame(table(gear)))

Reply
KCEric says:

February 11, 2015 at 10:57 am

Is there a way to do this but to have it generate a distribution %, like prop.table would give you?

For example:

gear freq
1 3 15
2 4 12
3 5 5

Would become:

gear freq-%
1 3 47%
2 4 37%
3 5 15.5%

Reply
- Eric Cai - The Chemical Statistician says:
  
  February 11, 2015 at 9:59 pm
  
  Hi KCEric,
  
  I can’t find a function that can do this, but I wrote one for you. “x” is the data frame that you produce from count().
  
  library(plyr)
  
  y = count(mtcars, ‘gear’)
  
  prop = function(x)
  {
  x$prop = x$freq/sum(x$freq)
  x$perc = x$freq/sum(x$freq)*100
  
  return(x)
  }
  
  > prop(y)
  gear freq prop perc
  1 3 15 0.46875 46.875
  2 4 12 0.37500 37.500
  3 5 5 0.15625 15.625
  
  Reply
  - KCEric says:
    
    February 26, 2015 at 1:37 pm
    
    Thank you so much! This was extremely helpful! I had been searching for a solution for a month because I’m very new to R.
  - Eric Cai - The Chemical Statistician says:
    
    February 26, 2015 at 2:59 pm
    
    You’re welcome, KCEric. Glad that I could help!
Stefan Fritsch says:

January 21, 2016 at 5:28 am

Hi,

the current situation in R is that you should use either dplyr or data.table for most data manipulation tasks. dplyr’s been mentioned already but with half a dozen variations so I’ll repeat the shortest solutions. 🙂

# dplyr:

library(dplyr)
count(mtcars, gear)

# data.table:

# Only necessary if you don’t use a data.table already
library(data.table)
dtcars <- as.data.table(mtcars)

dtcars[, .N,, gear]

## Percentages:

library(magrittr)
percentify % {.*100} %>% round(1) %>% paste0(“%”)

# dplyr:

mtcars %>%
count(gear) %>%
mutate(prop = n / sum(n), perc = percentify(prop))

# data.table:

dtcars[, .N,, gear][, prop := N/sum(N)][, perc := percentify(prop)] %>% print

(you need neither the pipe nor the print but the first makes it more readable and without the second the output is silent).

Ciao,
Stefan

Reply
- Eric Cai - The Chemical Statistician says:
  
  January 23, 2016 at 4:08 pm
  
  Thanks, Stefan! I really like dplyr, too, and I appreciate you taking the time to write this example!
  
  Reply
Thomas says:

April 23, 2017 at 5:18 pm

Is there any way to specify a frequency threshold? Let’s say I have a categorical variable ‘word’ that has 5 possible words and I want to find out which words only occur once. Is this possible?

I ask because I’m a linguist, and when we analyze naturally occurring speech we can end up with a data set containing over 300 unique words.

Reply
- Eric Cai - The Chemical Statistician says:
  
  April 24, 2017 at 1:50 pm
  
  Hi Thomas,
  
  Since the output of count() is an object that you can manipulate, you can then use a variety of functions to subset the output based on your desired condition. Here is an example that continues from my blog post. I will restrict the output to only the gear that occurs 5 times.
  
  > y = count(mtcars, ‘gear’)
  > y
  gear freq
  1 3 15
  2 4 12
  3 5 5
  > subset(y, freq == 5)
  gear freq
  3 5 5
  
  Hope this helps!
  
  Eric
  
  Reply
Cass says:

November 22, 2017 at 5:54 am

really clear explanation – thanks!

Reply
- Eric Cai - The Chemical Statistician says:
  
  December 1, 2017 at 5:12 pm
  
  You’re welcome, Cass!
  
  Reply
Ned says:

December 4, 2017 at 12:11 pm

Thanks guys, Old post but hopefully I can gain from your experience. is there a way to convert a frequency table back to it’s original form

Reply
- Eric Cai - The Chemical Statistician says:
  
  December 15, 2017 at 1:53 pm
  
  Hi Ned,
  
  I’m sorry for the late reply. Try this:
  
  > w = data.frame(value = y$gear, freq = y$freq)
  > u = rep(w$value, w$freq)
  > u
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5
  
  Reply
- Eric Cai - The Chemical Statistician says:
  
  December 15, 2017 at 5:38 pm
  
  Note that it is impossible to reverse the frequency table to assign each value to the original identifier. The best that you can do is get a vector of the values, like what I did above.
  
  Reply
Martin says:

December 23, 2020 at 11:03 am

Is there a way to count values based on a vector with item levels? I’m thinking about a syntax such as

count(mtcars,’gear’,c(0,3,4))

which should count the values of gear 0 (of which there is none!), gear 3 and gear 4.

Thanks in advance, Martin

Reply
- Eric Cai - The Chemical Statistician says:
  
  January 6, 2021 at 9:08 am
  
  Hi Martin – I don’t know of such a method, but I think that it would be easier for you to get the full output of count(), THEN filter for the particular values that you seek. Does that make sense?
  
  Reply

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

The Example Data Set – mtcars

Why the table() function does not work well

count() to the Rescue! (With Complements to the “plyr” Package)

32 Responses to How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

The Example Data Set – mtcars

Why the table() function does not work well

count() to the Rescue! (With Complements to the “plyr” Package)

Share this:

Related

32 Responses to How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories