Separating Unique and Duplicate Observations Using PROC SORT in SAS 9.3 and Newer Versions
April 10, 2015 4 Comments
As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step. My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicate observations, but that requires sorting the data set first before using the DATA step to do the separation. If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.
There is a data set called ADOMSG in the SASHELP library that is built into SAS. It has an identifier called MSGID, and there are duplicates by MSGID. Let’s create 2 data sets out of SASHELP.ADOMSG:
- DUPLICATES for storing the duplicate observations
- SINGLES for storing the unique observations
proc sort data = sashelp.adomsg out = duplicates uniqueout = singles nouniquekey; by msgid; run;
Here is the log:
NOTE: There were 459 observations read from the data set SASHELP.ADOMSG. NOTE: 300 observations with unique key values were deleted. NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables. NOTE: The data set WORK.SINGLES has 300 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.28 seconds cpu time 0.00 seconds
Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.
In addition to Fareeza, I also thank CB for sharing this tip.