Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them.  By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;
data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;

First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2.  I will export this set of observations into a data set called mismatches1, and I will print it for your viewing.  The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2.  The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

proc sql;
      create table mismatches1 as
      select * from dataset1
      where id
      not in
      (
           select id from dataset2 
      );
quit;
proc print
      data = mismatches1;
      title 'Mismatches on Agency ID';
run;

Here is the resulting data set.  As expected, the observations with ID values “222”, “333” and “666” are uncommon between the 2 data sets, so they were exported into mismatches1.

Mismatches on Agency ID
Obs id group gender age
1 222 D Male 12
2 333 E Female 13
3 666 G Female 14

Sometimes, I have 2 identifier variables, and I want to find the uncommon observations as identified by both of those variables.  Here is an example of how to do this with the variables “ID” and “GROUP” in our 2 data sets.  I will export this second set of observations into a data set called mismatches2, and I will print it for your viewing.  Here is the logic of the code:

  • find the IDs in dataset2 such that ID and GROUP are common between dataset1 and dataset2 – let’s call this set of IDs X
  • find the IDs in dataset1 such that do not exist in X
proc sql;
      create table mismatches2 as
      select * from dataset1 d1
      where id not in
      (
           select id from dataset2 d2 
           where d1.id = d2.id
                and 
           d1.group = d2.group
      );
quit;
proc print
      data = mismatches2;
      title 'Mismatches on Agency ID and Group';
run;

Here is the resulting data set.  Notice that, as expected, 111-B, 999-A and 999-B are also mismatches.

Mismatches on Agency ID and Group
Obs id group gender age
1 111 B Male 11
2 222 D Male 12
3 333 E Female 13
4 666 G Female 14
5 999 A Male 15
6 999 B Male 15

2 Responses to Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

  1. wbkrebs says:

    Your method for finding unique observations won’t work if you have two keys and there are values of the first key which are in both data sets for some values of the second key and only the first data set for other values.

    However, the following will work for that case:

    proc sql;
    create table mismatches2 as
    select * from dataset1 d1
    ( select id, group from dataset1
    except
    select id, group from dataset2 ) gl
    where d1.id = gl.id and d1.group = gl.group;
    quit;

    • Hi Wbkrebs,

      Thanks for taking the time to point this out and share the code. Could you please provide an example of this problem and how your code solves it? I struggle to understand what the problem is, though I am very interested in learning the limitations of my code.

      Many thanks,

      Eric

Your thoughtful comments are much appreciated!