← Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

Getting a List of the Variable Names of a SAS Data Set →

Getting All Duplicates of a SAS Data Set

January 5, 2015 9 Comments

Introduction

A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates. It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates and unique observations. I have written a new blog post to illustrate her solution.

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options. However, the output data set from such a procedure does not have the first of each set of duplicates. Here is an example.

data test;
     input var1 $;
     cards;
     111
     222
     333
     444
     444
     444
     555
     555
     666
     666
     666
     777
     888
     888
     999
     999
     999;
run;

The data set test has 5 sets of duplicates. A simple PROC FREQ shows the frequency counts of each possible value of var1.

proc freq
     data = test;
     tables var1 / nocum nocol norow nopercent;
run;

var1	Frequency
111	1
222	1
333	1
444	3
555	2
666	3
777	1
888	2
999	3

Here is the WRONG way to get the duplicates from test.

proc sort
     data = test
          nodupkey
          dupout = dups;
     by var1;
run;

proc print
     data = dups;
run;

Obs	var1
1	444
2	444
3	555
4	666
5	666
6	888
7	999
8	999

The PROC FREQ of the data set dups will show a different frequency count than that of test as shown previously.

var1	Frequency
444	2
555	1
666	2
888	1
999	2

When you use nodupkey and dupout in PROC SORT, you will remove the first of each set of duplicates in the output data set. That’s why each duplicated value has a frequency in dups that is one less than its frequency in test.

One* Right Way to Obtain Duplicates in SAS

To obtain ALL duplicates of a data set, you can take advantage of first.variable and last.variable. Here is the code to do it with the above example data set of test; you will get both the single observations and the duplicate observations. Note that you need to sort the data set by var1 first.

proc sort
     data = test;
     by var1;
run;


data single dup;
     set test;
     by var1;
     if first.var1 and last.var1 
          then output single;
     else output dup;
run;

Update: As Robert Matthews commented, you can alter this code to just obtain the duplicates – this would be more efficient if you have many single observations and few duplicate observations, and you don’t care about the singles.

data dup;
     set test;
     by var1;
     if first.var1 = 0 or last.var1 = 0;
run;

Explaining the SAS Code

Here is the logic behind the code. We are producing 2 data sets: single for the observations that appear only once, and dup for the duplicate observations.

If first.var1 and last.var1 both equal to “1” for the same observation, then there is no duplicate, and this observation belongs to single.

Otherwise, it’s a duplicate, and this observation belongs to dup.

Here is what you get as the output data set.

Obs	var1
1	444
2	444
3	444
4	555
5	555
6	666
7	666
8	666
9	888
10	888
11	999
12	999
13	999

Notice that it has the correct number of duplicates for each value.

*I have written a new blog post to illustrate another way to obtain duplicates and unique observations in SAS 9.3 and newer versions using PROC SORT’s new NOUNIQUEKEY option.

Filed under SAS Programming, Tutorials Tagged with data manipulation, data processing, duplicate, duplicates, dupout, first.variable, last.variable, nodupkey, PROC FREQ, PROC PRINT, PROC SORT, SAS

9 Responses to Getting All Duplicates of a SAS Data Set

Fareezaeeza says:

January 5, 2015 at 6:34 pm

Try the NOUNIQUEKEY option in Proc SORT available in SAS 9.3+

proc sort data=test nouniquekey out=want;
by var1;
run;

Reply
- Eric Cai - The Chemical Statistician says:
  
  January 5, 2015 at 7:30 pm
  
  Thanks, Fareeza – that worked!
  
  It’s still nice to have an option that works for earlier versions of SAS.
  
  Thanks for sharing!
  
  Reply
  - Benjy Ben-Baruch says:
    
    January 12, 2016 at 10:43 am
    
    We have not yet upgraded and I am still using SAS 9.2. I don’t like the proposed example because it does not deal with multiple key variables. Let’s say I want to “dedupe” on v1, v2, v3 and keep the observation with the “best” score. And let’s assume that the “best” score is the highest. Either of the following will work, and, in my experience, take about the same amount of time to run.
    
    PROC SORT DATA=DATA OUT=OUT ; BY V1 V2 V3 DESCENDING SCORE ;RUN ;
    PROC SORT DATA=OUT NODUPKEY ; BY V1 V2 V3 ; RUN ;
    
    ALSO:
    DATA OUT ; SET OUT ; BY V1 V2 V3 DESCENDING SCORE
    IF FIRST.V3=0 OR LAST.V3=0 ;
    RUN ;
    
    ALSO:
    PROC SQL ;
    CREATE TABLE OUT AS
    SELECT DISTINCT
    V1
    , V2
    , V3
    , MAX(SCORE)
    FROM DATA
    GROUP BY V1, V2, V3
    ; QUIT ;
Robert Matthews says:

January 27, 2015 at 2:27 pm

You could also do the following if you have a huge number of single observations (and don’t want to create a temporary dataset for them) and a relatively small number of dups.

data dups;
set test; by var1;
if first.var1=0 or last.var1=0;
run;

Reply
CB says:

January 27, 2015 at 2:31 pm

To extend upon @Fareezaeeza, if you want to split unique and duplicate observations into two different data sets, you can also do it with a single step:

proc sort data=test nouniquekeys uniqueout=singles out=dups;
by id;
run;

The PROC SORT statement starts the procedure. Observations in data set TEST are sorted in ascending order by ID.

The NOUNIQUEKEYS keyword deletes any observation where the value of ID is unique (i.e., occurs only once).

The UNIQUEOUT= keyword stores deleted observations with unique values in the specified data set (SINGLES).

The OUT= keyword stores all other observations (i.e., with non-unique values of ID) in the specified data set (DUPS).

If interested, check out: “Dealing with Duplicates” (2013)

Click to access 324-2013.pdf

Reply
paul says:

January 27, 2015 at 2:32 pm

When I want to see what dups are in a data set I use:
sort org;
by key1 key2 key3;
data dups;
set org;
by key1 key2 key3;
if not first.key3 or not last.key3;
run;

Pretty close to the same.

Reply
Mike Zdeb says:

January 27, 2015 at 3:50 pm

Hi … the data step method requires a SORT if it’s not sorted according to the BY variable (in this case, VAR1).

How about SQL, NO SORT required …

proc sql;
create table dups as
select * from test
where var1 in (select var1 from test group var1 having count(var1) gt 1);
quit;

Reply
Eric Cai - The Chemical Statistician says:

January 27, 2015 at 6:19 pm

Robert, CB, Paul and Mike – thank you all for your valuable comments! This is great!

Reply
BJ Mattson says:

January 28, 2015 at 10:09 am

When you need to review other fields of the records with duplicates (ex Patient ID), you can use Proc Summary, merge the counts back in the field you are checking for dups, and keep only those with frequencies greater than 1. (I’ve worked with cancer reporting abstracts where the tumor registrars needed these lists to identify records to merge/consolidate, or to create new tumor records).

Reply

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Getting All Duplicates of a SAS Data Set

Introduction

The Wrong Way to Obtain Duplicates in SAS

One* Right Way to Obtain Duplicates in SAS

Explaining the SAS Code

9 Responses to Getting All Duplicates of a SAS Data Set

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

Getting All Duplicates of a SAS Data Set

Introduction

The Wrong Way to Obtain Duplicates in SAS

One* Right Way to Obtain Duplicates in SAS

Explaining the SAS Code

Share this:

Related

9 Responses to Getting All Duplicates of a SAS Data Set

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories