Beware of accidental replacement of data sets with PROC SORT in SAS

PROC SORT is a very useful procedure in SAS.  Not only can you sort a data set on one or more variables with it, but you can sort each variable in ascending or descending order, and you can use it to obtain unique observations or duplicated observationsHowever, there is a feature about PROC SORT that can be dangerous and deserves emphasis: If you are not careful, you can accidentally replace an existing, valuable data set.

Suppose that you wish to use PROC SORT to get only the duplicated records of a data set.  Here is an example of how to do it.

data heights;
     input Name $ 
           Age 
           Height;
     datalines;
Amy 15 174
Amy 16 177
Bob 14 172
Cam 13 163
Cam 17 181
;
run;

proc sort
     data = heights
          nouniquekey;
     by Name;
run;

proc print
     data = heights;
run;
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Cam 13 163
4 Cam 17 181

Note that the record for “Bob” is gone from HEIGHTS, because it was a unique observation and, thus, removed in the above PROC SORT statement.

If the original data set is valuable, then this loss can be very damaging, especially if it took a lot of work and time to obtain the original data set.  This shows the danger of accidental replacement of a data set in SAS when using PROC SORT.

To avoid this awful problem, I encourage you to include an OUT option in the PROC SORT statement to create an output data set – whether you are using the NODUPKEY to get unique observations or NOUNIQUEKEY to get duplicated observations.  This preserves your original data set and creates a new data set to reflect what you seek from PROC SORT.

data heights;
     input Name $ 
           Age 
           Height;
     datalines;
Amy 15 174
Amy 16 177
Bob 14 172
Cam 13 163
Cam 17 181
;
run;

proc sort
     data = heights
          out = heights_duplicated
          uniqueout = heights_unique
          nouniquekey;
     by Name;
run;

proc print
     data = heights_duplicated;
     title 'heights_duplicated';
run;
title;

Here is the output of that PROC PRINT statement.

heights_duplicated
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Cam 13 163
4 Cam 17 181
Note that the original data set, HEIGHTS, remains unchanged.
proc print 
     data = heights;
     title 'heights'; 
run;
title;
heights
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Bob 14 172
4 Cam 13 163
5 Cam 17 181

 

Advertisements

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: