← Communication Tip – Write the message of the email BEFORE the subject and the recipients’ email addresses

A macro to automate the creation of indicator variables in SAS →

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

April 24, 2018 Leave a comment

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

They can be useful predictor variables in statistical models.
They can reduce the amount of memory required to store the data set.
They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process. This is especially true if there are many categorical variables, or if a categorical variable has many categories. In this tutorial, I will show an easy and efficient way to create indicator variables in SAS. I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library. Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately. (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY	STATE	ACTUAL	PREDICT
U.S.A.	California	$987.36	$692.24
U.S.A.	California	$1,782.96	$568.48
U.S.A.	California	$32.64	$16.32
U.S.A.	California	$1,825.12	$756.16
U.S.A.	California	$750.72	$723.52

PRODTYPE	PRODUCT	YEAR	QUARTER	MONTH	MONYR
FURNITURE	SOFA	1995	1	Jan	JAN95
FURNITURE	SOFA	1995	1	Feb	FEB95
FURNITURE	SOFA	1995	1	Mar	MAR95
FURNITURE	SOFA	1995	2	Apr	APR95
FURNITURE	SOFA	1995	2	May	MAY95

Let’s use PROC SQL to find the number of unique values of STATE in this data set. If you run,

proc sql;
     select count(distinct(STATE))
     from   SASHELP.PRDSAL2;
quit;

you will find that the answer is 16. (Readers who are familiar with geography in North America know that some of these “States” are actually Canadian provinces or Mexican states. I think that the creator of the data set used STATE in a malleable sense for brevity, so please don’t be alarmed by this incorrect usage.)

The traditional way of creating indicator variables would be to write code like this:

data sales1;
     set sashelp.prdsal2;

     if state = 'California'
          then California = 1;
     else California = 0;
run;

However, there are 16 states in this data set, so writing 16 blocks of code like this will be cumbersome, error-prone, and inefficient. The objective of this tutorial is to create indicator variables for the states in an automated way that is fast, easy, and efficient.

Pretending to run logistic regression to get indicator variables

If you read SAS usage note #23217, then you will learn that PROC LOGISTIC creates a design matrix for the categorical covariates in the CLASS statement. There are several design matrices that are possible, and they are chosen in the coding scheme, which is set by the PARAM option in the CLASS statement. If you use logistic regression in SAS regularly, then you are likely familiar with such design matrices. To accomplish our goal, I will pretend to run logistic regression for the purpose of creating the indicator variables in the design matrix. I don’t actually care about the results of the logistic regression; I just want the design matrix. For our purpose, we MUST specify PARAM = GLM as the parametrization in the CLASS statement, because this enforces the use of dummy coding. (Note that the default coding scheme in PROC LOGISTIC is effect coding.)

Let’s use PROC LOGISTIC to create this design matrix. You can actually use any target variable in the MODEL statement (even a numeric one!), but the procedure will run faster if you use a character variable with a minimal number of classes. I will pretend to run logistic regression with COUNTRY as the target variable, and STATE as the predictor variable. I will use the NOPRINT statement to avoid any unnecessary output, and I will specify the name of the design matrix (i.e. the data set of indicator variables) as “indicators”. Again, I will use PARAM = GLM in the CLASS statement to get the dummy coding in the design matrix; this is absolutely crucial for getting the indicator variables.

proc logistic
     data = sashelp.prdsal2
          noprint
          outdesign = indicators;
     class STATE / param = glm;
     model COUNTRY = STATE;
run;

Let’s print the first 5 observations of our design matrix, just to view what it looks like. For brevity, I will show only the first 5 columns. I strongly encourage you to use PROC PRINT to view all columns in SAS.

COUNTRY	Intercept	STATECalifornia
U.S.A.	1	1
U.S.A.	1	1
U.S.A.	1	1
U.S.A.	1	1
U.S.A.	1	1

Although no variable exists to identify the rows, you can be assured that each row in this design matrix corresponds to the original row in SASHELP.PRDSAL2. We can take advantage of this property to merge them together using the DATA STEP. In the code below, I will merge SASHELP.PRDSAL2 with INDICATORS. I will also remove “Intercept” and “COUNTRY” from the INDICATORS, because they are not needed.

data furniture_sales;
     merge sashelp.prdsal2
           indicators (drop = Intercept COUNTRY);
run;

Checking our answers

To show the correct indication in the design matrix, I will draw a random sample of 10 records from FURNITURE_SALES using PROC SURVEYSELECT, and I will specify the seed for you to replicate my result.

proc surveyselect
     data = furniture_sales
          out = furniture_sales_sample
          noprint
          seed = 719
          n = 5;
run;

If you use PROC FREQ, you will easily find that this sample contains the states Campeche, Michoacan, Ontario, and Washington. Let’s print the STATE variable plus just the indicator variables for those 4 states.

proc print
     data = furniture_sales_sample noobs;
     var state STATECampeche STATEMichoacan STATEOntario STATEWashington;
run;

STATE	STATECampeche	STATEMichoacan	STATEOntario	STATEWashington
Washington	0	0	0	1
Campeche	1	0	0	0
Michoacan	0	1	0	0
Ontario	0	0	1	0
Ontario	0	0	1	0

Notice how the dummy variables are correct in their indication of the states. You can now use this data set for all kinds of data analysis and statistical modelling!

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Data Analysis, SAS Programming, Statistics, Tutorials Tagged with categorical data, Categorical Data Analysis, categorical variable, data analysis, dummy coding, dummy variables, indicator, indicator variable, indicator variables, indicators, SAS, sas programming, statistics

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

The Example Data Set

Pretending to run logistic regression to get indicator variables

Checking our answers

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

The Example Data Set

Pretending to run logistic regression to get indicator variables

Checking our answers

Share this:

Related

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories