A SAS macro to automatically label variables using another data set

Introduction

When I write SAS programs, I usually export the analytical results into an output that a client will read.  I often cannot show the original variable names in these outputs; there are 2 reasons for this:

  • The maximal length of a SAS variable’s name is 32 characters, whereas the description of the variable can be much longer.  This is the case for my current job in marketing analytics.
  • Only letters, numbers, and underscores are allowed in a SAS variable’s name.  Spaces and special characters are not allowed.  Thus, if a variable’s name is quite long and complicated to describe, then the original variable name would be not suitable for presentation or awkward to read.  It may be so abbreviated that it is devoid of practical meaning.

This is why labelling variables can be a good idea.  However, I usually label variables manually in a DATA step or within PROC SQL, which can be very slow and prone to errors.  I recently worked on a data set with 193 variables, most of which require long descriptions to understand what they mean.  Labelling them individually and manually was not a realistic method, so I sought an automated or programmatic way to do so.

Read more of this post

Getting the names, types, formats, lengths, and labels of variables in a SAS data set

After reading my blog post on getting the variable names of a SAS data set, a reader named Robin asked how to get the formats as well.  I asked SAS Technical Support for help, and a consultant named Jerry Leonard provided a beautiful solution using PROC SQL.  Besides the names and formats of the variables, it also gives the types, lengths, and labels.  Here is an example of how to do so with the CLASS data set in the built-in SASHELP library.

* add formats and labels to 3 of the variables in the CLASS data set;
data class;                                                      
       set sashelp.class;                                            
       format 
            age 8.  
            weight height 8.2 
            name $15.;          
       label 
            age = 'Age'
            weight = 'Weight'
            height = 'Height';
run;                                                             
                  

* extract the variable information using PROC SQL; 
proc sql 
       noprint;                                                
       create table class_info as 
       select libname as library, 
              memname as data_set, 
              name as variable_name, 
              type, 
              length, 
              format, 
              label       
       from dictionary.columns                                       
       where libname = 'WORK' and memname = 'CLASS';                     
       /* libname and memname values must be upper case  */         
quit;                                                          
                   
 
* print the resulting table;
proc print 
       data = class_info;                                            
run;

Here is the result of that PROC PRINT step in the Results Viewer.  Notice that it also has the type, length, format, and label of each variable.

Obs library data_set variable_name type length format label
1 WORK CLASS Name char 8 $15.
2 WORK CLASS Sex char 1
3 WORK CLASS Age num 8 8. Age
4 WORK CLASS Height num 8 8.2 Height
5 WORK CLASS Weight num 8 8.2 Weight

Thank you, Jerry, for sharing your tip!

Adding Labels to Points in a Scatter Plot in R

What’s the Scatter?

A scatter plot displays the values of 2 variables for a set of data, and it is a very useful way to visualize data during exploratory data analysis, especially (though not exclusively) when you are interested in the relationship between a predictor variable and a target variable.  Sometimes, such data come with categorical labels that have important meanings, and the visualization of the relationship can be enhanced when these labels are attached to the data.

It is common practice to use a legend to label data that belong to a group, as I illustrated in a previous post on bar charts and pie charts.  However, what if every datum has a unique label, and there are many data in the scatter plot?  A legend would add unnecessary clutter in such situations.  Instead, it would be useful to write the label of each datum near its point in the scatter plot. I will show how to do this in R, illustrating the code with a built-in data set called LifeCycleSavings.

Read more of this post