data manipulation | The Chemical Statistician

A SAS macro to automatically label variables using another data set

March 4, 2018 4 Comments

Introduction

When I write SAS programs, I usually export the analytical results into an output that a client will read. I often cannot show the original variable names in these outputs; there are 2 reasons for this:

The maximal length of a SAS variable’s name is 32 characters, whereas the description of the variable can be much longer. This is the case for my current job in marketing analytics.
Only letters, numbers, and underscores are allowed in a SAS variable’s name. Spaces and special characters are not allowed. Thus, if a variable’s name is quite long and complicated to describe, then the original variable name would be not suitable for presentation or awkward to read. It may be so abbreviated that it is devoid of practical meaning.

This is why labelling variables can be a good idea. However, I usually label variables manually in a DATA step or within PROC SQL, which can be very slow and prone to errors. I recently worked on a data set with 193 variables, most of which require long descriptions to understand what they mean. Labelling them individually and manually was not a realistic method, so I sought an automated or programmatic way to do so.

Read more of this post

Filed under Data Analysis, SAS Programming, Tutorials Tagged with data, data analysis, data manipulation, data processing, label, labeling, labelling, labels, macro, SAS, sas macro

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

September 8, 2017 Leave a comment

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting. I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems. This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data set. I will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements. To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data = sashelp.baseball
          noprint
     out = bvars (keep = name type);
run;

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, SAS Programming, Statistics Tagged with applied statistics, correlation, correlation coefficient, data analysis, data manipulation, into statement, macro, macro variable, PROC SQL, programming, SAS, sas programming, SQL, statistics

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

April 17, 2015 2 Comments

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them. By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;

data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;

First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2. I will export this set of observations into a data set called mismatches1, and I will print it for your viewing. The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2. The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

Read more of this post

Filed under Data Analysis, SAS Programming, Tutorials Tagged with data analysis, data manipulation, PROC SQL, SAS, SQL, uncommon observations, uncommon rows

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

April 10, 2015 5 Comments

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step. My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicated observations, but that requires sorting the data set first before using the DATA step to do the separation. If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS. It has an identifier called MSGID, and there are duplicates by MSGID. Let’s create 2 data sets out of SASHELP.ADOMSG:

DUPLICATES for storing the duplicated observations
SINGLES for storing the unique observations

proc sort
     data = sashelp.adomsg
          out = duplicates
          uniqueout = singles
          nouniquekey;
     by msgid;
run;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.

Filed under Data Analysis, SAS Programming Tagged with data analysis, data manipulation, duplicate, first.variable, last.variable, nouniquekey, PROC SORT, SAS, sas programming, sorting, unique, uniqueout

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

February 23, 2015 Leave a comment

I had the great pleasure of speaking to the Department of Statistics and Actuarial Science at Simon Fraser University on last Friday to share my career advice with its students and professors. I emphasized the importance of learning skills in data manipulation during my presentation, and I want to supplement my presentation by posting some useful resources for this skill. If you are new to data manipulation, these are good guides for how to get started in R, SAS and Microsoft Excel.

For R, I recommend Winston Chang’s excellent web site, “Cookbook for R“. It has a specific section on manipulating data; this is a comprehensive list of the basic skills that every data analyst and statistician should learn.

For SAS, I recommend the UCLA statistical computing web page that is adapted from Oliver Schabenberger’s web site.

For Excel, I recommend Excel Easy, a web site that was started at the University of Amsterdam in 2010. It is a good resource for learning about Excel in general, and there is no background required. I specifically recommend the “Functions” and “Data Analysis” sections.

A blog called teachr has a good list of Top 10 skills in Excel to learn.

I like to document tips and tricks for R and SAS that I like to use often, especially if I struggled to find them on the Internet. I encourage you to check them out from time to time, especially in my “Data Analysis” category.

If you have any other favourite resources for learning data manipulation or data analysis, please share them in the comments!

Filed under Data Analysis, R programming, SAS Programming Tagged with data analysis, data manipulation, Excel, R, R programming, SAS, sas programming

Getting a List of the Variable Names of a SAS Data Set

January 6, 2015 6 Comments

Update on 2017-04-15: I have written a new blog post that obtains the names, types, formats, lengths, and labels of variables in a SAS data set. This uses PROC SQL instead of PROC CONTENTS. I thank Robin for suggesting this topic in the comments and Jerry Leonard from SAS Technical Support for teaching me this method.

Getting a list of the variable names of a data set is a fairly common and useful task in data analysis and manipulation, but there is actually no procedure or function that will do this directly in SAS. After some diligent searching on the Internet, I found a few tricks within PROC CONTENTS do accomplish this task.

Here is an example involving the built-in data set SASHELP.CLASS. The ultimate goal is to create a new data set called “variable_names” that contains the variable names in one column.

The results of PROC CONTENTS can be exported into a new data set. I will call this data set “data_info”, and it contains just 2 variables that we need – “name” and “varnum“.

Read more of this post

Filed under SAS Programming, Tutorials Tagged with data analysis, data manipulation, data processing, noprint, PROC CONTENTS, PROC SORT, SAS, varnum

Getting All Duplicates of a SAS Data Set

January 5, 2015 9 Comments

Introduction

A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates. It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates and unique observations. I have written a new blog post to illustrate her solution.

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options. However, the output data set from such a procedure does not have the first of each set of duplicates. Here is an example.

Read more of this post

Filed under SAS Programming, Tutorials Tagged with data manipulation, data processing, duplicate, duplicates, dupout, first.variable, last.variable, nodupkey, PROC FREQ, PROC PRINT, PROC SORT, SAS

Useful Functions in R for Manipulating Text Data

February 27, 2014 12 Comments

Introduction

In my current job, I study HIV at the genetic and biochemical levels. Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text. (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.) In this post, I describe some common functions in R that I often use for text processing.

Obtaining Basic Information about Character Variables

In R, I often work with text data in the form of character variables. To check if a variable is a character variable, use the is.character() function.

> year = 2014
> is.character(year)
[1] FALSE

If a variable is not a character variable, you can convert it to a character variable using the as.character() function.

> year.char = as.character(year)
> is.character(year.char)
[1] TRUE

A basic piece of information about a character variable is the number of characters that exist in this string. Use the nchar() function to obtain this information.

> nchar(year.char)
[1] 4

How to Find a Job in Statistics – Advice for Students and Recent Graduates

January 13, 2014 34 Comments

Introduction

A graduate student in statistics recently asked me for advice on how to find a job in our industry. I’m happy to share my advice about this, and I hope that my advice can help you to find a satisfying job and develop an enjoyable career. My perspectives would be most useful to students and recent graduates because of my similar but unique background; I graduated only 1.5 years ago from my Master’s degree in statistics at the University of Toronto, and I volunteered as a career advisor at Simon Fraser University during my Bachelor’s degree. My advice will reflect my experience in finding a job in Toronto, but you can probably find parallels in your own city.

Most of this post focuses on soft skills that are needed to find any job; I dive specifically into advice for statisticians in the last section. Although the soft skills are general and not specific to statisticians, many employers, veteran statisticians, and professors have told me that students and recent graduates would benefit from the focus on soft skills. Thus, I discuss them first and leave the statistics-specific advice till the end.

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

A SAS macro to automatically label variables using another data set

Introduction

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

Getting a List of the Variable Names of a SAS Data Set

Getting All Duplicates of a SAS Data Set

Introduction

The Wrong Way to Obtain Duplicates in SAS

Useful Functions in R for Manipulating Text Data

Introduction

Obtaining Basic Information about Character Variables

Read more of this post

How to Find a Job in Statistics – Advice for Students and Recent Graduates

Introduction

Read more of this post

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories