An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

  • They can be useful predictor variables in statistical models.
  • They can reduce the amount of memory required to store the data set.
  • They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process.  This is especially true if there are many categorical variables, or if a categorical variable has many categories.  In this tutorial, I will show an easy and efficient way to create indicator variables in SAS.  I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library.  Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately.  (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY STATE COUNTY ACTUAL PREDICT
U.S.A. California $987.36 $692.24
U.S.A. California $1,782.96 $568.48
U.S.A. California $32.64 $16.32
U.S.A. California $1,825.12 $756.16
U.S.A. California $750.72 $723.52

 

PRODTYPE PRODUCT YEAR QUARTER MONTH MONYR
FURNITURE SOFA 1995 1 Jan JAN95
FURNITURE SOFA 1995 1 Feb FEB95
FURNITURE SOFA 1995 1 Mar MAR95
FURNITURE SOFA 1995 2 Apr APR95
FURNITURE SOFA 1995 2 May MAY95

Let’s use PROC SQL to find the number of unique values of STATE in this data set.  If you run,

proc sql;
     select count(distinct(STATE))
     from   SASHELP.PRDSAL2;
quit;

you will find that the answer is 16.  (Readers who are familiar with geography in North America know that some of these “States” are actually Canadian provinces or Mexican states.  I think that the creator of the data set used STATE in a malleable sense for brevity, so please don’t be alarmed by this incorrect usage.)

The traditional way of creating indicator variables would be to write code like this:

data sales1;
     set sashelp.prdsal2;

     if state = 'California'
          then California = 1;
     else California = 0;
run;

However, there are 16 states in this data set, so writing 16 blocks of code like this will be cumbersome, error-prone, and inefficient.  The objective of this tutorial is to create indicator variables for the states in an automated way that is fast, easy, and efficient.

Pretending to run logistic regression to get indicator variables

If you read SAS usage note #23217, then you will learn that PROC LOGISTIC creates a design matrix for the categorical covariates in the CLASS statement.  There are several design matrices that are possible, and they are chosen in the coding scheme, which is set by the PARAM option in the CLASS statement.  If you use logistic regression in SAS regularly, then you are likely familiar with such design matrices.  To accomplish our goal, I will pretend to run logistic regression for the purpose of creating the indicator variables in the design matrix.  I don’t actually care about the results of the logistic regression; I just want the design matrix.  For our purpose, we MUST specify PARAM = GLM as the parametrization in the CLASS statement, because this enforces the use of dummy coding.  (Note that the default coding scheme in PROC LOGISTIC is effect coding.)

Let’s use PROC LOGISTIC to create this design matrix.   You can actually use any target variable in the MODEL statement (even a numeric one!), but the procedure will run faster if you use a character variable with a minimal number of classes.  I will pretend to run logistic regression with COUNTRY as the target variable, and STATE as the predictor variable.  I will use the NOPRINT statement to avoid any unnecessary output, and I will specify the name of the design matrix (i.e. the data set of indicator variables) as “indicators”.  Again, I will use PARAM = GLM in the CLASS statement to get the dummy coding in the design matrix; this is absolutely crucial for getting the indicator variables.

proc logistic
     data = sashelp.prdsal2
          noprint
          outdesign = indicators;
     class STATE / param = glm;
     model COUNTRY = STATE;
run;

Let’s print the first 5 observations of our design matrix, just to view what it looks like.  For brevity, I will show only the first 5 columns.  I strongly encourage you to use PROC PRINT to view all columns in SAS.

COUNTRY Intercept STATEBaja_California_Norte STATEBritish_Columbia STATECalifornia
U.S.A. 1 0 0 1
U.S.A. 1 0 0 1
U.S.A. 1 0 0 1
U.S.A. 1 0 0 1
U.S.A. 1 0 0 1

Although no variable exists to identify the rows, you can be assured that each row in this design matrix corresponds to the original row in SASHELP.PRDSAL2.  We can take advantage of this property to merge them together using the DATA STEP.  In the code below, I will merge SASHELP.PRDSAL2 with INDICATORS.  I will also remove “Intercept” and “COUNTRY” from the INDICATORS, because they are not needed.

data furniture_sales;
     merge sashelp.prdsal2
           indicators (drop = Intercept COUNTRY);
run;

Checking our answers

To show the correct indication in the design matrix, I will draw a random sample of 10 records from FURNITURE_SALES using PROC SURVEYSELECT, and I will specify the seed for you to replicate my result.

proc surveyselect
     data = furniture_sales
          out = furniture_sales_sample
          noprint
          seed = 719
          n = 5;
run;

If you use PROC FREQ, you will easily find that this sample contains the states Campeche, Michoacan, Ontario, and Washington.  Let’s print the STATE variable plus just the indicator variables for those 4 states.

proc print
     data = furniture_sales_sample noobs;
     var state STATECampeche STATEMichoacan STATEOntario STATEWashington;
run;
STATE STATECampeche STATEMichoacan STATEOntario STATEWashington
Washington 0 0 0 1
Campeche 1 0 0 0
Michoacan 0 1 0 0
Ontario 0 0 1 0
Ontario 0 0 1 0

Notice how the dummy variables are correct in their indication of the states.  You can now use this data set for all kinds of data analysis and statistical modelling!

 

Your thoughtful comments are much appreciated!