Getting All Duplicates of a SAS Data Set
January 5, 2015 9 Comments
A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates. It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.
*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates and unique observations. I have written a new blog post to illustrate her solution.
The Wrong Way to Obtain Duplicates in SAS
You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options. However, the output data set from such a procedure does not have the first of each set of duplicates. Here is an example.
data test; input var1 $; cards; 111 222 333 444 444 444 555 555 666 666 666 777 888 888 999 999 999; run;
The data set test has 5 sets of duplicates. A simple PROC FREQ shows the frequency counts of each possible value of var1.
proc freq data = test; tables var1 / nocum nocol norow nopercent; run;
Here is the WRONG way to get the duplicates from test.
proc sort data = test nodupkey dupout = dups; by var1; run;
proc print data = dups; run;
The PROC FREQ of the data set dups will show a different frequency count than that of test as shown previously.
When you use nodupkey and dupout in PROC SORT, you will remove the first of each set of duplicates in the output data set. That’s why each duplicated value has a frequency in dups that is one less than its frequency in test.
One* Right Way to Obtain Duplicates in SAS
To obtain ALL duplicates of a data set, you can take advantage of first.variable and last.variable. Here is the code to do it with the above example data set of test; you will get both the single observations and the duplicate observations. Note that you need to sort the data set by var1 first.
proc sort data = test; by var1; run; data single dup; set test; by var1; if first.var1 and last.var1 then output single; else output dup; run;
Update: As Robert Matthews commented, you can alter this code to just obtain the duplicates – this would be more efficient if you have many single observations and few duplicate observations, and you don’t care about the singles.
data dup; set test; by var1; if first.var1 = 0 or last.var1 = 0; run;
Explaining the SAS Code
Here is the logic behind the code. We are producing 2 data sets: single for the observations that appear only once, and dup for the duplicate observations.
If first.var1 and last.var1 both equal to “1” for the same observation, then there is no duplicate, and this observation belongs to single.
Otherwise, it’s a duplicate, and this observation belongs to dup.
Here is what you get as the output data set.
Notice that it has the correct number of duplicates for each value.
*I have written a new blog post to illustrate another way to obtain duplicates and unique observations in SAS 9.3 and newer versions using PROC SORT’s new NOUNIQUEKEY option.