# Use unique() instead of levels() to find the possible values of a factor in R

March 10, 2018 14 Comments

**In a previous version of this blog post, I incorrectly wrote that “Species” is a character variable. Instead, it is a factor. I thank the readers who corrected me in the comments.*

When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable. However, I recently noticed something very strange about this function.

Consider the built-in data set “iris” and its factor “Species”. Here are the possible values of “Species”, as shown by the levels() function.

> levels(iris$Species) [1] "setosa" "versicolor" "virginica"

Now, let’s remove all rows containing “setosa”. I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again.

> iris2 = subset(iris, Species != 'setosa') > table(iris2$Species) setosa versicolor virginica 0 50 50 > levels(iris2$Species) [1] "setosa" "versicolor" "virginica"

The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.

According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function. Here is my interpretation of the intent behind the creators of base R: The possible values of a factor are fundamental attributes of that variable, which should not be altered because of changes in the data.

Obviously, this can cause a lot of confusion and produce wrong information. Based on a comment on LinkedIn by Jack Davis, I will use the unique() function to find the possible values of a factor. Here is the result.

> unique(iris2$Species) [1] versicolor virginica Levels: setosa versicolor virginica

This is the output that I expect; “setosa” does not appear in the resulting vector. However, unique() stills hows the original levels, which include “setosa” – that’s a nice feature.

As the above thread on Stack Overflow suggested, you can use the droplevels() function to remove any levels that no longer exist in the subset.

> iris3 = droplevels(iris2) > levels(iris3$Species) [1] "versicolor" "virginica"

I thank my colleagues Layne Newhouse, Jack Davis, and Dmity Shopin for their valuable discussion about this on LinkedIn.

You may wish to actually give an example where this “can cause a lot of confusion and produce wrong information”. For those who’ve been using R for a long time this might seem perfectly natural and reasonable. In what way can it cause confusion or wrong answers?

Hi Ista,

I created a new data set called “iris2”, and it does not have “setosa” as a possible value for “Species”. If I use levels() to find the possible values of “Species”, then I would incorrectly obtain “setosa” as a possible value. That would be problematic.

I welcome your feedback about why this behaviour about levels() is desirable.

the term “possible value” is key. Removing all “setosa” rows from the data frame just removes those instances, but other instances might be present in a different subset of the data, say. Keeping a common set of all possible levels can make comparison easier.

the unique() function simply returns the unique instances that are present in the particular slice of data you are looking at, but not necessarily all of the “possible values” that a variable could take. It’s a key distinction.

When you say “character”, don’t you really mean “factors?” The former would return NULL if you tried to call `levels()` on it (which is another reason why it might be good to call `unique()`, as that works with both string and factor data.

By the by, if you want to drop unused factor levels when you subset a data frame, then you need to wrap the `droplevels` function around it:

`iris2 <- droplevels(subset(iris, Species != 'setosa'))`

Thanks for pointing out my mistake, Joe – I have corrected my blog post accordingly.

I have seen the droplevels() function. I find the unique() function to be easier, though I welcome your feedback about why I should use droplevels() instead.

Hi, if it may be useful to you, that happens because Species is a factor (and not a character). Indeed if you try to do the following “i <- iris; i$Species <- as.character(i$Species); levels(i$Species)" you will get a NULL. Loosely speaking, base::levels will give you all the possible values that the categorical variable can assume. I personally prefer to use dplyr::distinct to get unique values of a categorical variable from a data frame since it behaves consistently with both factors and characters.

Thanks for pointing out my mistake, Mic. I have corrected my blog post accordingly.

I like your solution using dplyr::distinct – thanks for sharing it!

One little related trick that might be helpful in situations like: to remove empty factor levels just use factor(), like this:

iris2$Species <- factor(iris2$Species)

Ah, that’s helpful. Thanks, Stuart!

There is a fundamental difference between the possible values and the present values. `levels()` is the appropriate solution if you want to know the possible values, `unique()` is fine if you just want to know the present values.

Only change the possible values for good reason, for instance because you know the setosa species will not be part of your current research scope anymore. If you are going to fit a model to that categorical variable and it is going to be dummy-coded, you could be faced with an unknown new column and the associated errors if the next dataset you are going to apply your model to does happen to contain the setosa value again. If they can be known, are constant and form a sufficiently small set, explicitly declaring all possible values as levels is A Good Thing.

Pingback: Levels And Unique In R – Curated SQL

what are the other ways to use to drop some unwanted data ? other than droplevel()

Thanks for sharing the alternative way of checking levels. I tried with levels(x) didn’t work,however, with unique(x) has worked out. Is it possible to get the number of levels?

Hi Juma,

You can always apply the length() function to the output of unique(). For example, length(levels(iris$Species)) will give you the value 3 – as desired.

Eric