April 17, 2015 Leave a comment
A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them. By “uncommon rows”, I mean rows whose identifier variable exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.
Let’s create 2 data sets.
data dataset1; input id $ group $ gender $ age; cards; 111 A Male 11 111 B Male 11 222 D Male 12 333 E Female 13 666 G Female 14 999 A Male 15 999 B Male 15 999 C Male 15 ; run;
data dataset2; input id $ group $ gender $ age; cards; 111 A Male 11 999 C Male 15 ; run;
First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2. I will export this set of observations into a data set called mismatches1, and I will print it for your viewing. The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2. The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.