Beware of accidental replacement of data sets with PROC SORT in SAS

PROC SORT is a very useful procedure in SAS.  Not only can you sort a data set on one or more variables with it, but you can sort each variable in ascending or descending order, and you can use it to obtain unique observations or duplicated observationsHowever, there is a feature about PROC SORT that can be dangerous and deserves emphasis: If you are not careful, you can accidentally replace an existing, valuable data set.

Suppose that you wish to use PROC SORT to get only the duplicated records of a data set.  Here is an example of how to do it.

data heights;
     input Name $ 
Amy 15 174
Amy 16 177
Bob 14 172
Cam 13 163
Cam 17 181

proc sort
     data = heights
     by Name;

proc print
     data = heights;
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Cam 13 163
4 Cam 17 181

Note that the record for “Bob” is gone from HEIGHTS, because it was a unique observation and, thus, removed in the above PROC SORT statement.

If the original data set is valuable, then this loss can be very damaging, especially if it took a lot of work and time to obtain the original data set.  This shows the danger of accidental replacement of a data set in SAS when using PROC SORT.

Read more of this post


A macro to execute PROC TTEST for multiple binary grouping variables in SAS (and sorting t-test statistics by their absolute values)

In SAS, you can perform PROC TTEST for multiple numeric variables in the same procedure.  Here is an example using the built-in data set SASHELP.BASEBALL; I will compare the number of at-bats and number of walks between the American League and the National League.

proc ttest
     data =;
     class League;
     var nAtBat nBB; 
     ods select ttests;

Here are the resulting tables.

Method Variances DF t Value Pr > |t|
Pooled Equal 320 2.05 0.0410
Satterthwaite Unequal 313.66 2.06 0.04

Method Variances DF t Value Pr > |t|
Pooled Equal 320 0.85 0.3940
Satterthwaite Unequal 319.53 0.86 0.3884


What if you want to perform PROC TTEST for multiple grouping (a.k.a. classification) variables?  You cannot put more than one variable in the CLASS statement, so you would have to run PROC TTEST separately for each binary grouping variable.  If you do put LEAGUE and DIVISION in the same CLASS statement, here is the resulting log.

1303 proc ttest
1304 data =;
1305 class league division;
ERROR 22-322: Expecting ;.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
1306 var natbat;
1307 ods select ttests;
1308 run;


There is no syntax in PROC TTEST to use multiple grouping variables at the same time, so this tutorial provides a macro to do so.  There are several nice features about my macro:

  1. It allows you to use multiple grouping variables at the same time.
  2. It sorts the t-test statistics by their absolute values within each grouping variable.
  3. It shows the name of each continuous variable in the output table, unlike the above output.

Here is its basic skeleton.

Read more of this post

Use ODS EXCLUDE ALL to suppress printing output in SAS while producing output data sets

I regularly produce output data sets from a SAS procedure, such as getting the variable names from a data set in PROC CONTENTS.  In these instances, I often wish to suppress any printing of the output in HTML or TXT.  Such printing of the results is often unnecessary, and it can cost a lot of time and memory.

Some SAS procedures have the NOPRINT option that suppresses the printing of output, but this is limiting in several ways:

  1. Some SAS procedures do NOT have the NOPRINT option.  PROC TTEST is a prominent example.  I checked the high-performance procedures like PROC HPFOREST (random forest) and PROC HPSVM (support vector machine), and I could not find the NOPRINT option for these procedures.
  2. I cannot use ODS OUTPUT to produce output data sets while invoking the NOPRINT option.  Here is an example.

Read more of this post

Convert multiple variables between character and numeric formats in SAS


I often get data that are coded as character, but are actually meant to be numeric.  Thus, converting them into the correct variable types is a common task, and SAS Note #24590 shows how to do so.  However, I recently needed to do hundreds of these conversions, so I wanted some code to accomplish this quickly and accurately.  This tutorial shows how to do so.

Let’s consider this small data set in SAS as an example.  They are hypothetical statistics of 3 players from a basketball game.

data basketball1;
     input jersey points $ rebounds $ assists $;
21 10 14 1
4  11 3  12
23 29 4  5

The 3 performance metrics (points, rebounds, and assists) are clearly numeric, but they are currently coded as character.  (You can use PROC CONTENTS to confirm this if needed.)

The jersey number is really a character variable, because its magnitude has no real-life meaning.  The National Basketball Association (NBA) allows “00” as a possible jersey number.  (Robert Parish wore this jersey number; he won 4 NBA championships and reached the Naismith Basketball Hall of Fame.)  If you code “00” as a numeric variable, then it will render as “0”.  Thus, for NBA jersey numbers, it is best to save it as a character variable.

I can convert these variables into the correct types using the following code.  Note that I chose “2.” for the length of “JERSEY”, because I know that jersey numbers in the NBA have, at most, 2 digits.

data basketball2;
     set basketball1;
     jersey2 = put(jersey, 2.);
     drop jersey;
     rename jersey2 = jersey;

     points2 = input(points, 8.);
     drop points;
     rename points2 = points;

     rebounds2 = input(rebounds, 8.);
     drop rebounds;
     rename rebounds2 = rebounds;

     assists2 = input(assists, 8.);
     drop assists;
     rename assists2 = assists;


Despite this success, the above code can be very cumbersome when I need to do this for many variables, and this situation arose in my job recently.  In this tutorial, I will show a fast way of doing these conversions for many variables at once.  I will use this BASKETBALL1 data set as an example, and I will convert POINTS, REBOUNDS, and ASSISTS from character to numeric simultaneously.

Read more of this post

A macro to automate the creation of indicator variables in SAS

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS.  This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding.  I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line.  If you have not read my previous post on this topic, then I strongly encourage you to do that first.  Don’t use this macro blindly.

Here is the macro.  The key steps are

  1. Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
  2. Merge the original data with the newly created indicator variables
  3. Delete the “INDICATORS” data set, which was created in an intermediate step
%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
     data = &input_data
          outdesign = indicators;
     class &covariates / param = glm;
     model &target = &covariates;

data &output_data;
      merge    &input_data
               indicators (drop = Intercept &target);

proc datasets 
     library = work
     delete indicators;


I will use the built-in data set SASHELP.CARS to illustrate the use of my macro.  As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables.  I will do that here for the variables TYPE, MAKE, and ORIGIN.

Read more of this post

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS


In statistics and biostatistics, the creation of binary indicators is a very useful practice.

  • They can be useful predictor variables in statistical models.
  • They can reduce the amount of memory required to store the data set.
  • They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process.  This is especially true if there are many categorical variables, or if a categorical variable has many categories.  In this tutorial, I will show an easy and efficient way to create indicator variables in SAS.  I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library.  Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately.  (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

U.S.A. California $987.36 $692.24
U.S.A. California $1,782.96 $568.48
U.S.A. California $32.64 $16.32
U.S.A. California $1,825.12 $756.16
U.S.A. California $750.72 $723.52



Read more of this post

Sort a data set by ascending or descending variables using PROC SORT in SAS

Consider the built-in data set SASHELP.CLASS in SAS.  Here are the first 5 observations from PROC PRINT.

Obs Name Sex Age Height Weight
1 Joyce F 11 51.3 50.5
2 Thomas M 11 57.5 85.0
3 James M 12 57.3 83.0
4 Jane F 12 59.8 84.5
5 John M 12 59.0 99.5

As you can clearly see, they are NOT sorted by weight.  Here is how you can sort the data set by weight using PROC SORT.

Read more of this post

Remove leading blanks when creating macro variables using PROC SQL in SAS

I regularly use PROC SQL to create macro variables in SAS, and I recently noticed a strange phenomenon when resolving a macro variable within double quotation marks in the title of a plot.  Thankfully, I was able to replicate this problem using the SASHELP.BASEBALL data set, which is publicly available.  I was then able to send the code and the strange result to SAS Technical Support for their examination.

proc sql;
     select count(name)
     into   :hitters_100plusHR
     where  CrHome > 100;

proc sgplot
     data =;
     histogram Salary;
     title1 'Distribution of salaries';
     title2 "Restricted to the &hitters_100plusHR hitters with more than 100 career home runs";


Here is the resulting plot.  Notice the extra spaces before “72” in the title of the plot.

SAS Technical Support informed me that

  • this problem is commonly known.
  • there is no way of predicting when it will occur
  • for now, the best way to deal with it is to remove the leading blanks using one of several ways.

Read more of this post

A SAS macro to automatically label variables using another data set


When I write SAS programs, I usually export the analytical results into an output that a client will read.  I often cannot show the original variable names in these outputs; there are 2 reasons for this:

  • The maximal length of a SAS variable’s name is 32 characters, whereas the description of the variable can be much longer.  This is the case for my current job in marketing analytics.
  • Only letters, numbers, and underscores are allowed in a SAS variable’s name.  Spaces and special characters are not allowed.  Thus, if a variable’s name is quite long and complicated to describe, then the original variable name would be not suitable for presentation or awkward to read.  It may be so abbreviated that it is devoid of practical meaning.

This is why labelling variables can be a good idea.  However, I usually label variables manually in a DATA step or within PROC SQL, which can be very slow and prone to errors.  I recently worked on a data set with 193 variables, most of which require long descriptions to understand what they mean.  Labelling them individually and manually was not a realistic method, so I sought an automated or programmatic way to do so.

Read more of this post

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting.  I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems.  This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data setI will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements.  To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data =
     out = bvars (keep = name type);

Read more of this post

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected.  Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;
     set sashelp.class;
     * define a new character variable to classify someone as tall or short;
     if height > 60
     then height_class = 'Tall';
          else height_class = 'Short';

* print the results for the first 5 rows;
proc print
     data = c1 (obs = 5);

Here is the result:

Obs Name Sex Age Height Weight height_class
1 Alfred M 14 69.0 112.5 Tall
2 Alice F 13 56.5 84.0 Shor
3 Barbara F 13 65.3 98.0 Tall
4 Carol F 14 62.8 102.5 Tall
5 Henry M 14 63.5 102.5 Tall

What happened?  Why does the word “Short” render as “Shor”?

Read more of this post

Getting the names, types, formats, lengths, and labels of variables in a SAS data set

After reading my blog post on getting the variable names of a SAS data set, a reader named Robin asked how to get the formats as well.  I asked SAS Technical Support for help, and a consultant named Jerry Leonard provided a beautiful solution using PROC SQL.  Besides the names and formats of the variables, it also gives the types, lengths, and labels.  Here is an example of how to do so with the CLASS data set in the built-in SASHELP library.

* add formats and labels to 3 of the variables in the CLASS data set;
data class;                                                      
       set sashelp.class;                                            
            age 8.  
            weight height 8.2 
            name $15.;          
            age = 'Age'
            weight = 'Weight'
            height = 'Height';

* extract the variable information using PROC SQL; 
proc sql 
       create table class_info as 
       select libname as library, 
              memname as data_set, 
              name as variable_name, 
       from dictionary.columns                                       
       where libname = 'WORK' and memname = 'CLASS';                     
       /* libname and memname values must be upper case  */         
* print the resulting table;
proc print 
       data = class_info;                                            

Here is the result of that PROC PRINT step in the Results Viewer.  Notice that it also has the type, length, format, and label of each variable.

Obs library data_set variable_name type length format label
1 WORK CLASS Name char 8 $15.
2 WORK CLASS Sex char 1
3 WORK CLASS Age num 8 8. Age
4 WORK CLASS Height num 8 8.2 Height
5 WORK CLASS Weight num 8 8.2 Weight

Thank you, Jerry, for sharing your tip!

Sorting correlation coefficients by their magnitudes in a SAS macro

Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, X and Y, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for \rho is the Pearson correlation coefficient, which is defined as the sample covariance between X and Y divided by the product of their sample standard deviations.  Since there is a common factor of

\frac{1}{n - 1}

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}} .


In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

  1. computing the correlation coefficients
  2. obtaining their absolute values
  3. sorting them by their absolute values.

Read more of this post

How to Extract a String Between 2 Characters in R and SAS


I recently needed to work with date values that look like this:

Jan 23/2
Aug 5/20
Dec 17/2

I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash.  I needed to think about how to program this carefully in both R and SAS, because

  1. the length of the day could be 1 or 2 characters long
  2. I needed a code that adapted to this varying length from observation to observation
  3. there is no function in either language that is suited exactly for this purpose.

In this tutorial, I will show you how to do this in both R and SAS.  I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!

Read more of this post

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them.  By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
      input id $ group $ gender $ age;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
data dataset2;
      input id $ group $ gender $ age;
      111 A Male 11
      999 C Male 15

First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2.  I will export this set of observations into a data set called mismatches1, and I will print it for your viewing.  The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2.  The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

Read more of this post

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step.  My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicated observations, but that requires sorting the data set first before using the DATA step to do the separation.  If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS.  It has an identifier called MSGID, and there are duplicates by MSGID.  Let’s create 2 data sets out of SASHELP.ADOMSG:

  • DUPLICATES for storing the duplicated observations
  • SINGLES for storing the unique observations
proc sort
     data = sashelp.adomsg
          out = duplicates
          uniqueout = singles
     by msgid;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

I had the great pleasure of speaking to the Department of Statistics and Actuarial Science at Simon Fraser University on last Friday to share my career advice with its students and professors.  I emphasized the importance of learning skills in data manipulation during my presentation, and I want to supplement my presentation by posting some useful resources for this skill.  If you are new to data manipulation, these are good guides for how to get started in R, SAS and Microsoft Excel.

For R, I recommend Winston Chang’s excellent web site, “Cookbook for R“.  It has a specific section on manipulating data; this is a comprehensive list of the basic skills that every data analyst and statistician should learn.

For SAS, I recommend the UCLA statistical computing web page that is adapted from Oliver Schabenberger’s web site.

For Excel, I recommend Excel Easy, a web site that was started at the University of Amsterdam in 2010.  It is a good resource for learning about Excel in general, and there is no background required.  I specifically recommend the “Functions” and “Data Analysis” sections.

A blog called teachr has a good list of Top 10 skills in Excel to learn.

I like to document tips and tricks for R and SAS that I like to use often, especially if I struggled to find them on the Internet.  I encourage you to check them out from time to time, especially in my “Data Analysis” category.

If you have any other favourite resources for learning data manipulation or data analysis, please share them in the comments!

Using PROC SGPLOT to Produce Box Plots with Contrasting Colours in SAS

I previously explained the statistics behind box plots and how to produce them in R in a very detailed tutorial.  I also illustrated how to produce side-by-side box plots with contrasting patterns in R.

Here is an example of how to make box plots in SAS using the VBOX statement in PROC SGPLOT.  I modified the built-in data set SASHELP.CLASS to generate one that suits my needs.

The PROC TEMPLATE statement specifies the contrasting colours to be used for different classes.  I also include code for exportingthe result into a PDF file using ODS PDF.  (I used varying shades of grey to allow the contrast to be shown when printed in black and white.)


Read more of this post

Getting a List of the Variable Names of a SAS Data Set

Update on 2017-04-15: I have written a new blog post that obtains the names, types, formats, lengths, and labels of variables in a SAS data set.  This uses PROC SQL instead of PROC CONTENTS.  I thank Robin for suggesting this topic in the comments and Jerry Leonard from SAS Technical Support for teaching me this method.


Getting a list of the variable names of a data set is a fairly common and useful task in data analysis and manipulation, but there is actually no procedure or function that will do this directly in SAS.  After some diligent searching on the Internet, I found a few tricks within PROC CONTENTS do accomplish this task.

Here is an example involving the built-in data set SASHELP.CLASS.  The ultimate goal is to create a new data set called “variable_names” that contains the variable names in one column.

The results of PROC CONTENTS can be exported into a new data set.  I will call this data set “data_info”, and it contains just 2 variables that we need – “name” and “varnum“.

Read more of this post

Getting All Duplicates of a SAS Data Set


A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates.  It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates and unique observations.  I have written a new blog post to illustrate her solution.

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options.  However, the output data set from such a procedure does not have the first of each set of duplicates.  Here is an example.

Read more of this post