How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions.  For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate.  This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this.  Sadly, I could not find such a function.  In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example.  The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear.  We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

What are the possible values of “gear”?  Let’s use the factor() function to find out.

> factor(mtcars$gear)
 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5

The cars in this data set have either 3, 4 or 5 forward gears.  How many cars are there for each number of forward gears?

Why the table() function does not work well

The table() function in Base R does give the counts of a categorical variable, but the output is not a data frame – it’s a table, and it’s not easily accessible like a data frame.

> w = table(mtcars$gear)
> w
3 4 5 
15 12 5 

> class(w)
[1] "table"

You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name.

> t = as.data.frame(w)
> t
    Var1 Freq
1   3    15
2   4    12
3   5    5

You can correct this problem with the names() function.

> names(t)[1] = 'gear'
> t
    gear Freq
1   3    15
2   4    12
3   5    5

I finally have what I want, but that took several functions to accomplish.  Is there an easier way?

Update: As David commented, you can use the “dnn” option in the table() function to add the column name.

count() to the Rescue!  (With Complements to the “plyr” Package)

Thankfully, there is an easier way – it’s the count() function in the “plyr” package.  If you don’t already have the “plyr” package, install it first by running the command

 install.packages('plyr')

Then, call its library, and the count() function will be ready for use.

> library(plyr)
> count(mtcars, 'gear')
       gear      freq
1      3         15
2      4         12
3      5         5
> y = count(mtcars, 'gear')
> y
       gear      freq
1      3         15
2      4         12
3      5         5
> class(y)
[1] "data.frame"

As the class() function confirms, this output is indeed a data frame!

22 Responses to How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

  1. Pingback: Condensed News | Data Analytics & R

  2. Kent Russell says:

    Just a couple of other ways to do this.

    table(mtcars$gear,dnn=”gear”)
    xtabs(~gear,mtcars)

    With dplyr which will be way faster than plyr.

    library(dplyr)
    mtcars %>%
    group_by( gear ) %>%
    tally

  3. neuwirthe says:

    As much as I appreciate plyr, and even more dplyr, there are solutions in base R.
    These are not, however, oneliners.

    One issue is mtcars$gear is a vector and does not have attributes any more.
    So table cannot be expected to preserve the name.

    as.data.frame(table(subset(mtcars,select=’gear’)))

    should work, but does not preserve the variable name.
    This is surprising since

    as.data.frame(table(subset(mtcars,select=c(‘gear’,’cyl’))))

    works as needed.
    The problem seems to be that dataframes with just one variable are not treated
    as they should be in table.

    Since table has the argument dnn for dimension names,
    one can use the ugly solution

    as.data.frame(table(subset(mtcars,select=’gear’),dnn=’gear’))

    Perhaps you should post the problem with names not being preserved to r-devel.
    It might be considered a bug.

  4. neuwirthe says:

    Let me add:

    library(dplyr)
    mtcars %>%
    group_by(gear) %>%
    summarise(count=n())

    is my preferred solution

    • library(‘dplyr’)
      library(‘tidyr’)

      # in dplyr count() does group_by and summarise by n() or sum(n) impicitly:
      # the fun begins when counting more than 1 variable or
      # combined with mutate or spread (from tidyr)

      mtcars %>% count(gear) %>% mutate(prop = n/sum(n))

      # same as:
      mtcars %>% count(am, gear) %>% mutate(prop = prop.table(n)) # !!! it’s always margin 1

      mtcars %>% count(am, gear) %>% mutate(prop = n/sum(n))
      mtcars %>% count(am, gear) %>% spread(gear, n)

  5. stevenvannoy says:

    Thanks, an easy solution to a common task.

  6. David says:

    This requires 1 simple line of code with table:

    data.frame(table(mtcars$gear, dnn=’gear’))

    I don’t see why you would want to install/load plyr if this is all you’re doing.

    • I should have checked the documentation for table() to learn the “dnn” option – thanks for sharing this, David.

      While the above does work, it’s still longer, less intuitive and more cumbersome than count(mtcars, ‘gear’). In a later blog post, I will discuss the advantages of count() for displaying cross-tabulations in the list format – and these advantages over table() will be magnified.

      Stay tuned, and thanks for commenting.

    • For me this one is more intuitive, because you declare names in this way in matrix(), data.frame().

      as.data.frame.table( table( gear = mtcars$gear))
      gear Freq
      1 3 15
      2 4 12
      3 5 5

  7. The same result can be generated by the follows:
    as.data.frame(xtabs(~gear, data=mtcars))

  8. An alternative solution to the one presented, would be to use dplyr:

    > library(dplyr)

    > ft % group_by(gear) %>% summarise(freq=n())

    > ft
    Source: local data frame [3 x 2]

    gear freq
    1 3 15
    2 4 12
    3 5 5

  9. edwcarney says:

    I think that plyr::count is worth knowing, and I’m happy to have come across Eric’s suggestion. It also will take a formula as the second parameter. This generates a flat list instead of a crosstabulation table.

    Here’s the result for count(mtcars,gear~disp)

    gear cyl freq
    1 3 4 1
    2 3 6 2
    3 3 8 12
    4 4 4 8
    5 4 6 4
    6 5 4 2
    7 5 6 1
    8 5 8 2

    Try count(mtcars,gear~cyl+disp) to see the benefit of this over base::table.

  10. Wojciech Sobala says:

    with(mtcars, as.data.frame(table(gear)))

  11. KCEric says:

    Is there a way to do this but to have it generate a distribution %, like prop.table would give you?

    For example:

    gear freq
    1 3 15
    2 4 12
    3 5 5

    Would become:

    gear freq-%
    1 3 47%
    2 4 37%
    3 5 15.5%

  12. Hi,

    the current situation in R is that you should use either dplyr or data.table for most data manipulation tasks. dplyr’s been mentioned already but with half a dozen variations so I’ll repeat the shortest solutions.🙂

    # dplyr:

    library(dplyr)
    count(mtcars, gear)

    # data.table:

    # Only necessary if you don’t use a data.table already
    library(data.table)
    dtcars <- as.data.table(mtcars)

    dtcars[, .N,, gear]

    ## Percentages:

    library(magrittr)
    percentify % {.*100} %>% round(1) %>% paste0(“%”)

    # dplyr:

    mtcars %>%
    count(gear) %>%
    mutate(prop = n / sum(n), perc = percentify(prop))

    # data.table:

    dtcars[, .N,, gear][, prop := N/sum(N)][, perc := percentify(prop)] %>% print

    (you need neither the pipe nor the print but the first makes it more readable and without the second the output is silent).

    Ciao,
    Stefan

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: