Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting.  I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems.  This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data setI will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements.  To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data = sashelp.baseball
          noprint
     out = bvars (keep = name type);
run;

Read more of this post

Advertisements

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected.  Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;
     set sashelp.class;
 
     * define a new character variable to classify someone as tall or short;
     if height > 60
     then height_class = 'Tall';
          else height_class = 'Short';
run;


* print the results for the first 5 rows;
proc print
     data = c1 (obs = 5);
run;

Here is the result:

Obs Name Sex Age Height Weight height_class
1 Alfred M 14 69.0 112.5 Tall
2 Alice F 13 56.5 84.0 Shor
3 Barbara F 13 65.3 98.0 Tall
4 Carol F 14 62.8 102.5 Tall
5 Henry M 14 63.5 102.5 Tall

What happened?  Why does the word “Short” render as “Shor”?

Read more of this post

Getting the names, types, formats, lengths, and labels of variables in a SAS data set

After reading my blog post on getting the variable names of a SAS data set, a reader named Robin asked how to get the formats as well.  I asked SAS Technical Support for help, and a consultant named Jerry Leonard provided a beautiful solution using PROC SQL.  Besides the names and formats of the variables, it also gives the types, lengths, and labels.  Here is an example of how to do so with the CLASS data set in the built-in SASHELP library.

* add formats and labels to 3 of the variables in the CLASS data set;
data class;                                                      
       set sashelp.class;                                            
       format 
            age 8.  
            weight height 8.2 
            name $15.;          
       label 
            age = 'Age'
            weight = 'Weight'
            height = 'Height';
run;                                                             
                  

* extract the variable information using PROC SQL; 
proc sql 
       noprint;                                                
       create table class_info as 
       select libname as library, 
              memname as data_set, 
              name as variable_name, 
              type, 
              length, 
              format, 
              label       
       from dictionary.columns                                       
       where libname = 'WORK' and memname = 'CLASS';                     
       /* libname and memname values must be upper case  */         
quit;                                                          
                   
 
* print the resulting table;
proc print 
       data = class_info;                                            
run;

Here is the result of that PROC PRINT step in the Results Viewer.  Notice that it also has the type, length, format, and label of each variable.

Obs library data_set variable_name type length format label
1 WORK CLASS Name char 8 $15.
2 WORK CLASS Sex char 1
3 WORK CLASS Age num 8 8. Age
4 WORK CLASS Height num 8 8.2 Height
5 WORK CLASS Weight num 8 8.2 Weight

Thank you, Jerry, for sharing your tip!

Sorting correlation coefficients by their magnitudes in a SAS macro

Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, X and Y, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for \rho is the Pearson correlation coefficient, which is defined as the sample covariance between X and Y divided by the product of their sample standard deviations.  Since there is a common factor of

\frac{1}{n - 1}

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}} .

 

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

  1. computing the correlation coefficients
  2. obtaining their absolute values
  3. sorting them by their absolute values.

Read more of this post

Analyst Finder – A Free Job-Matching Service for Statisticians, Data Scientists, Database Managers and Data Analysts

If you are a statistician, data scientist, database manager, or data analyst, then consider using Analyst Finder for your next job search.  It is a web site that connects job seekers in data analytics with employers.  The service is free for job seekers, and it earns money by charging companies and recruiters a small fee to find qualified candidates through its job-matching service.

To register for this service as a job seeker, you simply need to complete a check list of skills and preferences.  It’s quick and easy to do, and you can change this list whenever your wish to update your qualifications.

af

 

Employers and recruiters can register for this service to search for qualified candidates, and the fees are displayed on the web site.

This company was founded by Art Tabachneck, a former president of the Toronto Area SAS Society and a veteran analytics professional.  In case you’re interested in learning more about him, SAS has a profile about him in recognition of his expertise in SAS programming and his contributions to online support communities and user groups.

New Job at the Bank of Montreal in Toronto

I have accepted an offer from the Bank of Montreal to become a Manager of Operational Risk Analytics and Modelling at its corporate headquarter office in Toronto.  Thus, I have resigned from my job at the British Columbia Cancer Agency.  I will leave Vancouver at the end of December, 2015, and start my new job at the beginning of January, 2016.

I have learned some valuable skills and met some great people here in Vancouver over the past 2 years.  My R programming skills have improved a lot, especially in text processing.  My SAS programming skills have improved a lot, and I began a new section on my blog to SAS programming as a result of what I learned.  I volunteered and delivered presentations for the Vancouver SAS User Group (VanSUG) – once on statistical genetics, and another on sampling strategies in analytical chemistry, ANOVA, and PROC TRANSPOSE.  I have thoroughly enjoyed meeting some smart and helpful people at the Data Science, Machine Learning, and R Programming Meetups.

I lived in Toronto from 2011 to 2013 while pursuing my Master’s degree in statistics at the  University of Toronto and working as a statistician at Predicum.  I look forward to re-connecting with my colleagues there.

Vancouver SAS User Group Meeting – Wednesday, November 4, 2015

I am excited to present at the next Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.  I will illustrate data transposition and ANOVA in SAS and JMP using potato chips and analytical chemistry.  Come and check it out!  The following agenda contains all of the presentations, and you can register for this meeting on the SAS Canada web site.  This meeting is free, and a free breakfast will be served in the morning.

 

Update: My slides from this presentation have been posted on the VanSUG web site.

 

Date: Wednesday, November 4, 2015

Place:

Ballroom West and Centre

Holiday Inn – Vancouver Centre

711 West Broadway, Vancouver, BC

V5Z 3Y2

(604) 879-0511

Agenda:

8:30am – 9:00am: Registration

9:00am – 9:20am: Introductions and SAS Update – Matt Malczewski, SAS Canada

9:20am – 9:40am: Lessons On Transposing Data, Sampling & ANOVA in SAS & JMP – Eric Cai, Cancer Surveillance & Outcomes, BC Cancer Agency

9.40am – 10.20am: Make SAS Enterprise Guide Your Own – John Ladds, Statistics Canada

10:20am – 10:30am: A Beginner’s Experience Using SAS – Kim Burrus, Cancer Surveillance & Outcomes, BC Cancer Agency

10:30am – 11:00am: Networking Break

11:00am – 11.20am: Using SAS for Simple Calculations – Jay Shurgold, Rick Hansen Institute

11:20am – 11:50am: Yes, We Can… Save SAS Formats – John Ladds, Statistics Canada

11:50am – 12:20pm: Reducing Customer Attrition with Predictive Analytics – Nate Derby, Stakana Analytics

12:20pm – 12:30pm: Evaluations, Prize Draw & Closing Remarks

If you would like to be notified of upcoming SAS User Group Meetings in Vancouver, please subscribe to the Vancouver SAS User Group Distribution List.

How to Extract a String Between 2 Characters in R and SAS

Introduction

I recently needed to work with date values that look like this:

mydate
Jan 23/2
Aug 5/20
Dec 17/2

I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash.  I needed to think about how to program this carefully in both R and SAS, because

  1. the length of the day could be 1 or 2 characters long
  2. I needed a code that adapted to this varying length from observation to observation
  3. there is no function in either language that is suited exactly for this purpose.

In this tutorial, I will show you how to do this in both R and SAS.  I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!

Read more of this post

Eric’s Enlightenment for Friday, May 29, 2015

  1. P2N3: An aromatic ion made of just phosphorous and nitrogen.  (Yes, aromaticity can be entirely inorganic!)
  2. Using 3-D printing and plastics to make prosthetics.
  3. David Beckwroth and Scott Sumner talk at length about reforming monetary policy with NGDP targeting in this video interview/seminar.
  4. Anky Lai gives a nice introduction to PROC TABULATE (PDF document) – an alternative to PROC FREQ and PROC MEANS in SAS.  Check out her awesome code samples for generating nicely formatted tables and exporting them conveniently into spreadsheets in Excel!

Eric’s Enlightenment for Wednesday, May 6, 2015

  1. Moldova has mysteriously lost one-eighth of its GDP, possibly to fraudulent loans.
  2. Kai Brothers was diagnosed with HIV in 1989, but did not show any symptoms for 25 years.  Does he have a natural defense against HIV?  Now that he is starting to show symptoms, should he start taking anti-retroviral drugs and deny scientists the chance to look for that natural defense in his blood?
  3. Use the VVALUE function in SAS to convert formatted values of a variable into new values of a character variable.
  4. Alex Reinhart diligently compiled and explained a list of major “egregious statistical fallacies regularly committed in the name of science”.  Check them out on his web site and in his book entitled “Statistics Done Wrong“.  I highly recommend reading the section entitled “The p value and the base rate fallacy“.

Eric’s Enlightenment for Friday, May 1, 2015

  1. PROC GLIMMIX Contrasted with Other SAS Statistical Procedures for Regression (including GENMOD, MIXED, NLMIXED, LOGISTIC and CATMOD).
  2. Lee-Ping Wang et al. recently developed the nanoreactor, “a computer model that can not only determine all the possible products of the Urey-Miller experiment, but also detail all the possible chemical reactions that lead to their formation”.  What an exciting development!  It “incorporates physics and machine learning to discover all the possible ways that your chemicals might react, and that might include reactions or mechanisms we’ve never seen before”.  Here is the original paper.
  3. A Quora thread on the best examples of the Law of Unintended Consequences
  4. In a 2-minute video, Alex Tabarrok argues why software patents should be eliminated.

Eric’s Enlightenment for Friday, April 24, 2015

  1. Anna Katherine Barnett-Hart wrote an empirical study of how collateralized debt obligations contributed to $542 billion in losses suffered by financial institutions during the sub-prime mortgage crisis.  This was her honours thesis for her Bachelor of Arts degree at Harvard.  Here is her interview about her work with Dylan Ratigan.
  2. Instead of money, Toyota offered its engineers to help The Food Bank for New York City to improve its operations.  Thanks to their guidance, the wait time for dinner was cut from 90 minutes to 18 minutes.
  3. Unusual, simple, yet effective: Dispensing with loyalty cards or rewards programs, Pret A Manger spontaneously offers free food to reward loyal customers.
  4. A fast and simple macro for getting the sample size of a data set in SAS.  (No need to scroll through PROC CONTENTS in case you have many variables.)
  5. Jon Chui provides some useful pictorial guides for interpreting infrared spectra and proton nuclear magnetic resonance spectra.

Eric’s Enlightenment for Monday, April 20, 2015

  1. John D. Cook explains why 0! is defined to be equal to 1.  This is also an excellent post on how definitions are created in mathematics.
  2. Why are GDP estimates often unreliable?  (Jonathan Jones wrote this report for Britain, but it’s likely applicable to all countries.)
  3. Rick Wicklin shares useful code for counting the number of missing and non-missing observations in a data set in SAS.
  4. A potentially game-changing breakthrough in artificial photosynthesis may be able to solve the world’s carbon emission problem…

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them.  By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;
data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;

First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2.  I will export this set of observations into a data set called mismatches1, and I will print it for your viewing.  The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2.  The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

Read more of this post

Separating Unique and Duplicate Observations Using PROC SORT in SAS 9.3 and Newer Versions

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step.  My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicate observations, but that requires sorting the data set first before using the DATA step to do the separation.  If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS.  It has an identifier called MSGID, and there are duplicates by MSGID.  Let’s create 2 data sets out of SASHELP.ADOMSG:

  • DUPLICATES for storing the duplicate observations
  • SINGLES for storing the unique observations
proc sort
     data = sashelp.adomsg
          out = duplicates
          uniqueout = singles
          nouniquekey;
     by msgid;
run;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

I had the great pleasure of speaking to the Department of Statistics and Actuarial Science at Simon Fraser University on last Friday to share my career advice with its students and professors.  I emphasized the importance of learning skills in data manipulation during my presentation, and I want to supplement my presentation by posting some useful resources for this skill.  If you are new to data manipulation, these are good guides for how to get started in R, SAS and Microsoft Excel.

For R, I recommend Winston Chang’s excellent web site, “Cookbook for R“.  It has a specific section on manipulating data; this is a comprehensive list of the basic skills that every data analyst and statistician should learn.

For SAS, I recommend the UCLA statistical computing web page that is adapted from Oliver Schabenberger’s web site.

For Excel, I recommend Excel Easy, a web site that was started at the University of Amsterdam in 2010.  It is a good resource for learning about Excel in general, and there is no background required.  I specifically recommend the “Functions” and “Data Analysis” sections.

A blog called teachr has a good list of Top 10 skills in Excel to learn.

I like to document tips and tricks for R and SAS that I like to use often, especially if I struggled to find them on the Internet.  I encourage you to check them out from time to time, especially in my “Data Analysis” category.

If you have any other favourite resources for learning data manipulation or data analysis, please share them in the comments!

Using PROC SGPLOT to Produce Box Plots with Contrasting Colours in SAS

I previously explained the statistics behind box plots and how to produce them in R in a very detailed tutorial.  I also illustrated how to produce side-by-side box plots with contrasting patterns in R.

Here is an example of how to make box plots in SAS using the VBOX statement in PROC SGPLOT.  I modified the built-in data set SASHELP.CLASS to generate one that suits my needs.

The PROC TEMPLATE statement specifies the contrasting colours to be used for different classes.  I also include code for exportingthe result into a PDF file using ODS PDF.  (I used varying shades of grey to allow the contrast to be shown when printed in black and white.)

boxplots

Read more of this post

Getting a List of the Variable Names of a SAS Data Set

Update on 2017-04-15: I have written a new blog post that obtains the names, types, formats, lengths, and labels of variables in a SAS data set.  This uses PROC SQL instead of PROC CONTENTS.  I thank Robin for suggesting this topic in the comments and Jerry Leonard from SAS Technical Support for teaching me this method.

 

Getting a list of the variable names of a data set is a fairly common and useful task in data analysis and manipulation, but there is actually no procedure or function that will do this directly in SAS.  After some diligent searching on the Internet, I found a few tricks within PROC CONTENTS do accomplish this task.

Here is an example involving the built-in data set SASHELP.CLASS.  The ultimate goal is to create a new data set called “variable_names” that contains the variable names in one column.

The results of PROC CONTENTS can be exported into a new data set.  I will call this data set “data_info”, and it contains just 2 variables that we need – “name” and “varnum“.

Read more of this post

Getting All Duplicates of a SAS Data Set

Introduction

A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates.  It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates and unique observations.  I have written a new blog post to illustrate her solution.

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options.  However, the output data set from such a procedure does not have the first of each set of duplicates.  Here is an example.

Read more of this post