Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS
April 17, 2015 2 Comments
A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them. By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.
Let’s create 2 data sets.
data dataset1; input id $ group $ gender $ age; cards; 111 A Male 11 111 B Male 11 222 D Male 12 333 E Female 13 666 G Female 14 999 A Male 15 999 B Male 15 999 C Male 15 ; run;
data dataset2; input id $ group $ gender $ age; cards; 111 A Male 11 999 C Male 15 ; run;
First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2. I will export this set of observations into a data set called mismatches1, and I will print it for your viewing. The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2. The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.
proc sql; create table mismatches1 as select * from dataset1 where id not in ( select id from dataset2 ); quit;
proc print data = mismatches1; title 'Mismatches on Agency ID'; run;
Here is the resulting data set. As expected, the observations with ID values “222”, “333” and “666” are uncommon between the 2 data sets, so they were exported into mismatches1.
|Mismatches on Agency ID|
Sometimes, I have 2 identifier variables, and I want to find the uncommon observations as identified by both of those variables. Here is an example of how to do this with the variables “ID” and “GROUP” in our 2 data sets. I will export this second set of observations into a data set called mismatches2, and I will print it for your viewing. Here is the logic of the code:
- find the IDs in dataset2 such that ID and GROUP are common between dataset1 and dataset2 – let’s call this set of IDs X
- find the IDs in dataset1 such that do not exist in X
proc sql; create table mismatches2 as select * from dataset1 d1 where id not in ( select id from dataset2 d2 where d1.id = d2.id and d1.group = d2.group ); quit;
proc print data = mismatches2; title 'Mismatches on Agency ID and Group'; run;
Here is the resulting data set. Notice that, as expected, 111-B, 999-A and 999-B are also mismatches.
|Mismatches on Agency ID and Group|