Beware of accidental replacement of data sets with PROC SORT in SAS

PROC SORT is a very useful procedure in SAS.  Not only can you sort a data set on one or more variables with it, but you can sort each variable in ascending or descending order, and you can use it to obtain unique observations or duplicated observationsHowever, there is a feature about PROC SORT that can be dangerous and deserves emphasis: If you are not careful, you can accidentally replace an existing, valuable data set.

Suppose that you wish to use PROC SORT to get only the duplicated records of a data set.  Here is an example of how to do it.

data heights;
     input Name $ 
           Age 
           Height;
     datalines;
Amy 15 174
Amy 16 177
Bob 14 172
Cam 13 163
Cam 17 181
;
run;

proc sort
     data = heights
          nouniquekey;
     by Name;
run;

proc print
     data = heights;
run;
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Cam 13 163
4 Cam 17 181

Note that the record for “Bob” is gone from HEIGHTS, because it was a unique observation and, thus, removed in the above PROC SORT statement.

If the original data set is valuable, then this loss can be very damaging, especially if it took a lot of work and time to obtain the original data set.  This shows the danger of accidental replacement of a data set in SAS when using PROC SORT.

Read more of this post

Advertisements

Use ODS EXCLUDE ALL to suppress printing output in SAS while producing output data sets

I regularly produce output data sets from a SAS procedure, such as getting the variable names from a data set in PROC CONTENTS.  In these instances, I often wish to suppress any printing of the output in HTML or TXT.  Such printing of the results is often unnecessary, and it can cost a lot of time and memory.

Some SAS procedures have the NOPRINT option that suppresses the printing of output, but this is limiting in several ways:

  1. Some SAS procedures do NOT have the NOPRINT option.  PROC TTEST is a prominent example.  I checked the high-performance procedures like PROC HPFOREST (random forest) and PROC HPSVM (support vector machine), and I could not find the NOPRINT option for these procedures.
  2. I cannot use ODS OUTPUT to produce output data sets while invoking the NOPRINT option.  Here is an example.

Read more of this post

Convert multiple variables between character and numeric formats in SAS

Introduction

I often get data that are coded as character, but are actually meant to be numeric.  Thus, converting them into the correct variable types is a common task, and SAS Note #24590 shows how to do so.  However, I recently needed to do hundreds of these conversions, so I wanted some code to accomplish this quickly and accurately.  This tutorial shows how to do so.

Let’s consider this small data set in SAS as an example.  They are hypothetical statistics of 3 players from a basketball game.

data basketball1;
     input jersey points $ rebounds $ assists $;
     datalines;
21 10 14 1
4  11 3  12
23 29 4  5
;
run;

The 3 performance metrics (points, rebounds, and assists) are clearly numeric, but they are currently coded as character.  (You can use PROC CONTENTS to confirm this if needed.)

The jersey number is really a character variable, because its magnitude has no real-life meaning.  The National Basketball Association (NBA) allows “00” as a possible jersey number.  (Robert Parish wore this jersey number; he won 4 NBA championships and reached the Naismith Basketball Hall of Fame.)  If you code “00” as a numeric variable, then it will render as “0”.  Thus, for NBA jersey numbers, it is best to save it as a character variable.

I can convert these variables into the correct types using the following code.  Note that I chose “2.” for the length of “JERSEY”, because I know that jersey numbers in the NBA have, at most, 2 digits.

data basketball2;
     set basketball1;
 
     jersey2 = put(jersey, 2.);
     drop jersey;
     rename jersey2 = jersey;

     points2 = input(points, 8.);
     drop points;
     rename points2 = points;

     rebounds2 = input(rebounds, 8.);
     drop rebounds;
     rename rebounds2 = rebounds;

     assists2 = input(assists, 8.);
     drop assists;
     rename assists2 = assists;
run;

 

Despite this success, the above code can be very cumbersome when I need to do this for many variables, and this situation arose in my job recently.  In this tutorial, I will show a fast way of doing these conversions for many variables at once.  I will use this BASKETBALL1 data set as an example, and I will convert POINTS, REBOUNDS, and ASSISTS from character to numeric simultaneously.

Read more of this post

A macro to automate the creation of indicator variables in SAS

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS.  This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding.  I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line.  If you have not read my previous post on this topic, then I strongly encourage you to do that first.  Don’t use this macro blindly.

Here is the macro.  The key steps are

  1. Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
  2. Merge the original data with the newly created indicator variables
  3. Delete the “INDICATORS” data set, which was created in an intermediate step
%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
     data = &input_data
          noprint
          outdesign = indicators;
     class &covariates / param = glm;
     model &target = &covariates;
run;


data &output_data;
      merge    &input_data
               indicators (drop = Intercept &target);
run;


proc datasets 
     library = work
          noprint;
     delete indicators;
run;

%mend;

I will use the built-in data set SASHELP.CARS to illustrate the use of my macro.  As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables.  I will do that here for the variables TYPE, MAKE, and ORIGIN.

Read more of this post

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

  • They can be useful predictor variables in statistical models.
  • They can reduce the amount of memory required to store the data set.
  • They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process.  This is especially true if there are many categorical variables, or if a categorical variable has many categories.  In this tutorial, I will show an easy and efficient way to create indicator variables in SAS.  I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library.  Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately.  (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY STATE COUNTY ACTUAL PREDICT
U.S.A. California $987.36 $692.24
U.S.A. California $1,782.96 $568.48
U.S.A. California $32.64 $16.32
U.S.A. California $1,825.12 $756.16
U.S.A. California $750.72 $723.52

 

PRODTYPE PRODUCT YEAR QUARTER MONTH MONYR
FURNITURE SOFA 1995 1 Jan JAN95
FURNITURE SOFA 1995 1 Feb FEB95
FURNITURE SOFA 1995 1 Mar MAR95
FURNITURE SOFA 1995 2 Apr APR95
FURNITURE SOFA 1995 2 May MAY95

Read more of this post

Remove leading blanks when creating macro variables using PROC SQL in SAS

I regularly use PROC SQL to create macro variables in SAS, and I recently noticed a strange phenomenon when resolving a macro variable within double quotation marks in the title of a plot.  Thankfully, I was able to replicate this problem using the SASHELP.BASEBALL data set, which is publicly available.  I was then able to send the code and the strange result to SAS Technical Support for their examination.

proc sql;
     select count(name)
     into   :hitters_100plusHR
     from   sashelp.baseball
     where  CrHome > 100;
quit;


proc sgplot
     data = sashelp.baseball;
     histogram Salary;
     title1 'Distribution of salaries';
     title2 "Restricted to the &hitters_100plusHR hitters with more than 100 career home runs";
run;

 

Here is the resulting plot.  Notice the extra spaces before “72” in the title of the plot.

SAS Technical Support informed me that

  • this problem is commonly known.
  • there is no way of predicting when it will occur
  • for now, the best way to deal with it is to remove the leading blanks using one of several ways.

Read more of this post

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting.  I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems.  This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data setI will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements.  To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data = sashelp.baseball
          noprint
     out = bvars (keep = name type);
run;

Read more of this post

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected.  Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;
     set sashelp.class;
 
     * define a new character variable to classify someone as tall or short;
     if height > 60
     then height_class = 'Tall';
          else height_class = 'Short';
run;


* print the results for the first 5 rows;
proc print
     data = c1 (obs = 5);
run;

Here is the result:

Obs Name Sex Age Height Weight height_class
1 Alfred M 14 69.0 112.5 Tall
2 Alice F 13 56.5 84.0 Shor
3 Barbara F 13 65.3 98.0 Tall
4 Carol F 14 62.8 102.5 Tall
5 Henry M 14 63.5 102.5 Tall

What happened?  Why does the word “Short” render as “Shor”?

Read more of this post

New Job at the Bank of Montreal in Toronto

I have accepted an offer from the Bank of Montreal to become a Manager of Operational Risk Analytics and Modelling at its corporate headquarter office in Toronto.  Thus, I have resigned from my job at the British Columbia Cancer Agency.  I will leave Vancouver at the end of December, 2015, and start my new job at the beginning of January, 2016.

I have learned some valuable skills and met some great people here in Vancouver over the past 2 years.  My R programming skills have improved a lot, especially in text processing.  My SAS programming skills have improved a lot, and I began a new section on my blog to SAS programming as a result of what I learned.  I volunteered and delivered presentations for the Vancouver SAS User Group (VanSUG) – once on statistical genetics, and another on sampling strategies in analytical chemistry, ANOVA, and PROC TRANSPOSE.  I have thoroughly enjoyed meeting some smart and helpful people at the Data Science, Machine Learning, and R Programming Meetups.

I lived in Toronto from 2011 to 2013 while pursuing my Master’s degree in statistics at the  University of Toronto and working as a statistician at Predicum.  I look forward to re-connecting with my colleagues there.

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step.  My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicated observations, but that requires sorting the data set first before using the DATA step to do the separation.  If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS.  It has an identifier called MSGID, and there are duplicates by MSGID.  Let’s create 2 data sets out of SASHELP.ADOMSG:

  • DUPLICATES for storing the duplicated observations
  • SINGLES for storing the unique observations
proc sort
     data = sashelp.adomsg
          out = duplicates
          uniqueout = singles
          nouniquekey;
     by msgid;
run;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

I had the great pleasure of speaking to the Department of Statistics and Actuarial Science at Simon Fraser University on last Friday to share my career advice with its students and professors.  I emphasized the importance of learning skills in data manipulation during my presentation, and I want to supplement my presentation by posting some useful resources for this skill.  If you are new to data manipulation, these are good guides for how to get started in R, SAS and Microsoft Excel.

For R, I recommend Winston Chang’s excellent web site, “Cookbook for R“.  It has a specific section on manipulating data; this is a comprehensive list of the basic skills that every data analyst and statistician should learn.

For SAS, I recommend the UCLA statistical computing web page that is adapted from Oliver Schabenberger’s web site.

For Excel, I recommend Excel Easy, a web site that was started at the University of Amsterdam in 2010.  It is a good resource for learning about Excel in general, and there is no background required.  I specifically recommend the “Functions” and “Data Analysis” sections.

A blog called teachr has a good list of Top 10 skills in Excel to learn.

I like to document tips and tricks for R and SAS that I like to use often, especially if I struggled to find them on the Internet.  I encourage you to check them out from time to time, especially in my “Data Analysis” category.

If you have any other favourite resources for learning data manipulation or data analysis, please share them in the comments!

Performing Logistic Regression in R and SAS

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

  1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
  2. treatment – Whether or not the patient completed an anger control treatment program.
  3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Read more of this post

Vancouver SAS User Group Meeting – Wednesday, November 26, 2014, at Holiday Inn Vancouver-Centre (West Broadway)

I am pleased to have recently joined the executive organizing team of the Vancouver SAS User Group.  We hold meetings twice per year to allow Metro Vancouver users of all kinds of SAS products to share their knowledge, tips and advice with others.  These events are free to attend, but registration is required.

SAS Logo - The Power to Know

Our next meeting will be held on Wednesday, November 26, 2014.  Starting from 8:30 am, a free breakfast will be served while registration takes place.  The session will begin at 9:00 am and end at 12:30 pm with a prize draw.

Please note that there is a new location for this meeting: the East and Centre Ballrooms at Holiday Inn Vancouver-Centre at 711 West Broadway in Vancouver.  We will also experiment with holding a half-day session by ending at 12:30 pm at this meeting.  Visit our web site for more information and to register for this free event!

If you will attend this event, please feel free to come and say “Hello”!

Read the rest of this post for the full agenda!

Read more of this post

Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

Read more of this post

Useful Options For Every SAS Program – Lessons and Resources from Dr. Jerry Brunner

Introduction

Today, I want to share some useful options to put at the beginning of every SAS program that I write.  These options will make the practicality of using SAS much easier.  My applied statistics professor from the University of Toronto, Jerry Brunner*, taught me some of these options when I first learned SAS in our class, and I’m grateful for that.  In later instances of using SAS in team projects, I have met SAS programmers who were delightfully surprised by the existence of these options and desperately wished that they had learned them earlier.  I hope that they will help you with your SAS programming.  I have also learned some useful options by posting questions on the SAS Support Communities online forum.

 

Clearing Output

After running your SAS program many times to test and debug, you will have accumulated numerous pages of old and useless output and log.  Scrolling through and searching for the desired portion to read in either file can be tedious and difficult.  Thus, it’s really helpful to have the option of clearing all of the output and the log whenever you run your script.  I put the following commands on top of every one of my SAS scripts.

/* 
Useful Options For Every SAS Program 
- With Some Tips Learned From Dr. Jerry Brunner
by Eric Cai - The Chemical Statistician
*/

dm 'cle log; cle out;';
ods html closeods html;

dm 'odsresults; clear';
ods listing close;
ods listing;

Read more of this post