Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them.  By “uncommon rows”, I mean rows whose identifier variable exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
input id $group$ gender $age; cards; 111 A Male 11 111 B Male 11 222 D Male 12 333 E Female 13 666 G Female 14 999 A Male 15 999 B Male 15 999 C Male 15 ; run; data dataset2; input id$ group $gender$ age;
cards;
111 A Male 11
999 C Male 15
;
run;



First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2.  I will export this set of observations into a data set called mismatches1, and I will print it for your viewing.  The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2.  The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

Separating Unique and Duplicate Variables Using PROC SORT in SAS 9.3 and Newer Versions

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step.  My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicate observations, but that requires sorting the data set first before using the DATA step to do the separation.  If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS.  It has an identifier called MSGID, and there are duplicates by MSGID.  Let’s create 2 data sets out of SASHELP.ADOMSG:

• DUPLICATES for storing the duplicate observations
• SINGLES for storing the unique observations
proc sort
out = duplicates
uniqueout = singles
nouniquekey;
by msgid;
run;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.

Seminar by Tiff Macklem: Lessons Learned from the Global Financial Crisis – Monday, March 30, 2015

I look forward to attending an upcoming seminar in Vancouver by Tiff Macklem on how he helped to manage the global financial crisis in 2008 while working as the Senior Deputy Governor in the Bank of Canada.  He is now the Dean of the Rotman School of Management at the University of Toronto.  This is an event for alumni of the University of Toronto and their guests.

Monday, March 30, 2015

6:30 PM to 8:30 PM

Metropolitan Room – Terminal City Club

837 West Hastings Street

V6C 1B6

If you will attend this seminar, please feel free to come and say “Hello”!

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

I had the great pleasure of speaking to the Department of Statistics and Actuarial Science at Simon Fraser University on last Friday to share my career advice with its students and professors.  I emphasized the importance of learning skills in data manipulation during my presentation, and I want to supplement my presentation by posting some useful resources for this skill.  If you are new to data manipulation, these are good guides for how to get started in R, SAS and Microsoft Excel.

For R, I recommend Winston Chang’s excellent web site, “Cookbook for R“.  It has a specific section on manipulating data; this is a comprehensive list of the basic skills that every data analyst and statistician should learn.

For SAS, I recommend the UCLA statistical computing web page that is adapted from Oliver Schabenberger’s web site.

For Excel, I recommend Excel Easy, a web site that was started at the University of Amsterdam in 2010.  It is a good resource for learning about Excel in general, and there is no background required.  I specifically recommend the “Functions” and “Data Analysis” sections.

A blog called teachr has a good list of Top 10 skills in Excel to learn.

I like to document tips and tricks for R and SAS that I like to use often, especially if I struggled to find them on the Internet.  I encourage you to check them out from time to time, especially in my “Data Analysis” category.

If you have any other favourite resources for learning data manipulation or data analysis, please share them in the comments!

Career Seminar at Department of Statistics and Actuarial Science, Simon Fraser University: 1:30 – 2:20 pm, Friday, February 20, 2015

I am very pleased to be invited to speak to the faculty and students in the Department of Statistics and Actuarial Science at Simon Fraser University on this upcoming Friday.  I look forward to sharing my career advice and answering questions from the students about how to succeed in a career in statistics.  If you will attend this seminar, please feel free to come and say “Hello”!

The advantages of using count() to get N-way frequency tables as data frames in R

Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R.  Several commenters provided alternative ways of doing so, and they are all appreciated.  Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation.  Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
gear
cyl      3     4     5
4        1     8     2
6        2     4     1
8        12    0     2

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions.  For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate.  This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this.  Sadly, I could not find such a function.  In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example.  The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear.  We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Valiant              3

Free Seminar on Sports Analytics + Free Food @ Boston Pizza in New Westminster: 7 pm, Friday, January 23, 2015

I will attend the following seminar at 7 pm on Friday, January 23, 2015.  It will be held in a private dining room at Boston Pizza (1045 Columbia Street in New Westminster, British Columbia).  This seminar is part of Café Scientifique, an ongoing series of public lectures from my alma mater, Simon Fraser University.

If you will attend, please come and say “Hello”!

Reserve your free seat by emailing: cafe_scientifique@sfu.ca
**Note that there is no accent above the “e” in this address.

Friday, January 23, 2015

Speaker: Dr. Tim Swartz, Professor, Department of Statistics & Actuarial ScienceSimon Fraser University

Research interest: My general interest is statistical computing. Most of my work attempts to take advantage of the power of modern computing machinery to solve real statistical problems. The area where I have devoted a lot of attention is the integration problem arising in Bayesian applications. Lately, my interest in statistics in sport has grown to consume a fair bit of my time, perhaps too much of my time.

Topic: Sports Analytics

Sports analytics has become an important area of emphasis for professional sports teams in their attempt to obtain a competitive edge. The discussion will revolve around recent work that Dr. Swartz has conducted in sports analytics such as the optimal time to pull a goalie in hockey, insights into home team advantage and the value of draft positions in major league soccer.

Café Scientifique is a series of informal discussions connecting research to important issues of interest to the community.  Enjoy light snacks and refreshments while engaging with cutting-edge, award-winning researchers from Simon Fraser University’s (SFU) Faculty of Science.

Using PROC SGPLOT to Produce Box Plots with Contrasting Colours in SAS

I previously explained the statistics behind box plots and how to produce them in R in a very detailed tutorial.  I also illustrated how to produce side-by-side box plots with contrasting patterns in R.

Here is an example of how to make box plots in SAS using the VBOX statement in PROC SGPLOT.  I modified the built-in data set SASHELP.CLASS to generate one that suits my needs.

The PROC TEMPLATE statement specifies the contrasting colours to be used for different classes.  I also include code for exportingthe result into a PDF file using ODS PDF.  (I used varying shades of grey to allow the contrast to be shown when printed in black and white.)

Getting a List of the Variable Names of a SAS Data Set

Getting a list of the variable names of a data set is a fairly common and useful task in data analysis and manipulation, but there is actually no procedure or function that will do this directly in SAS.  After some diligent searching on the Internet, I found a few tricks within PROC CONTENTS do accomplish this task.

Here is an example involving the built-in data set SASHELP.CLASS.  The ultimate goal is to create a new data set called “variable_names” that contains the variable names in one column.

The results of PROC CONTENTS can be exported into a new data set.  I will call this data set “data_info”, and it contains just 2 variables that we need – “name” and “varnum“.

Getting All Duplicates of a SAS Data Set

Introduction

A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates.  It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates.

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options.  However, the output data set from such a procedure does not have the first of each set of duplicates.  Here is an example.

Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples.  Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis!  Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

The Chemical Statistician Received 10,000 Views in November, 2014!

I am very pleased to announce that, for the first time in this blog’s history, The Chemical Statistician received 10,000 views in a month!  The total view count for November, 2014, was 10,094.  What a milestone!

Thank you very much to all of my readers for visiting and contributing to this blog in the comments.  Your enthusiastic support makes my efforts all the more worthwhile, so thank you!

I will rest from blogging for much of December.  Enjoy your holidays, and see you in 2015!

Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.

If

$E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,$

implies that

$P \{g[T(\mathbf{X})]\} = 0] = 1,$

then $T(\mathbf{X})$ is said to be complete.  To deconstruct this esoteric mathematical statement,

1. let $g(t)$ be a measurable function
2. if you want to use $g[T(\mathbf{X})]$ to form an unbiased estimator of the zero function,
3. and if the only such function is almost surely equal to the zero function,
4. then $T(\mathbf{X})$ is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

Organic Chemistry Lesson of the Day – The 2 Conformational Isomers of Ethane

The simplest case of conformational isomerism belongs to ethane, C2H6.

Newman projections of the 2 conformational isomers of ethane.

Image courtesy of Mr.Holmium via Wikimedia.

In the Newman projections above, you can see that the dihedral angle between any 2 vicinal hydrogens plays a key role in the stability of ethane.  In particular, there are 2 extrema in that plot of the change in Gibbs free energy vs. the dihedral angle:

• The minimum is attained when the dihedral angle is $180 \times (2n + 1) \div 3$ degrees, where $n$ is any integer $(n = 0, \pm 1, \pm 2, \pm 3, ...)$.  In other words, the vicinal hydrogens are as far apart from each other as possible.  This conformation is called the staggered conformation.
• The maximum is attained when the dihedral angle is $180 \times (2n) \div 3$ degrees, where $n$ is any integer $(n = 0, \pm 1, \pm 2, \pm 3, ...)$.  In other words, the vicinal hydrogens are as close to each other as possible.  This conformation is called the eclipsed conformation.

The stability of ethane is dependent on this dihedral angle.

• If the vicinal hydrogens are far part from each other (in a staggered conformation, for example), then there is less torsional strain* between the 2 carbon-hydrogen bonds, resulting in more stability.
• If the vicinal hydrogens are close to each other (in an eclipsed conformation, for example), then there is greater torsional strain* between the 2 carbon-hydrogen bonds resulting in less stability.

*In my undergraduate education, I learned that the greater stability in the staggered conformation is due to less torsional (steric) strain.  However, Vojislava Pophristic & Lionel Goodman (2001) argued that the effect is actually due to the stabilizing effect of hyperconjugation.  Song et al. (2005) and Mo and Yao (2007) rebutted this argument in separate publications.  Read these articles as searched under “ethane hyperconjugation steric strain” on Google Scholar for more information.

References

• Pophristic, V., & Goodman, L. (2001). Hyperconjugation not steric repulsion leads to the staggered structure of ethane. Nature, 411(6837), 565-568.
• Song, L., Lin, Y., Wu, W., Zhang, Q., & Mo, Y. (2005). Steric strain versus hyperconjugative stabilization in ethane congeners. The Journal of Physical Chemistry A, 109(10), 2310-2316.
• Mo, Y., & Gao, J. (2007). Theoretical analysis of the rotational barrier of ethane. Accounts of chemical research, 40(2), 113-119.

Performing Logistic Regression in R and SAS

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
2. treatment – Whether or not the patient completed an anger control treatment program.
3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Organic and Inorganic Chemistry Lesson of the Day – Conformational Isomers (or Conformers)

Conformational isomerism is a special type of stereoisomerism that arises from the rotation of a single bond.  Specifically, 2 molecules are conformational isomers (or conformers) if they can be interconverted exclusively by the rotation of a single bond.  This type of isomerism differs from configurational stereoisomerism, whose isomers can only be interconverted by breaking certain bonds and reattaching* them to produce different 3-dimensional orientations.  Examples of configurational isomers include enantiomers, diastereomers, cis/trans isomers and meso isomers.

Different conformers are notable for having different stabilities, depending on the electrostatic interactions between the substituents along the single bond of interest.  I will talk about these differences in greater depth in future Chemistry Lessons of the Day.

*Such reattachment of the bonds must not result in different connectivities (or sequence of bonds); otherwise, that would result in structural isomers.

Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

Vancouver SAS User Group Meeting – Wednesday, November 26, 2014, at Holiday Inn Vancouver-Centre (West Broadway)

I am pleased to have recently joined the executive organizing team of the Vancouver SAS User Group.  We hold meetings twice per year to allow Metro Vancouver users of all kinds of SAS products to share their knowledge, tips and advice with others.  These events are free to attend, but registration is required.

Our next meeting will be held on Wednesday, November 26, 2014.  Starting from 8:30 am, a free breakfast will be served while registration takes place.  The session will begin at 9:00 am and end at 12:30 pm with a prize draw.

Please note that there is a new location for this meeting: the East and Centre Ballrooms at Holiday Inn Vancouver-Centre at 711 West Broadway in Vancouver.  We will also experiment with holding a half-day session by ending at 12:30 pm at this meeting.  Visit our web site for more information and to register for this free event!

If you will attend this event, please feel free to come and say “Hello”!

Read the rest of this post for the full agenda!

Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, $X_1, X_2, ..., X_n$, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all $n$ variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, $T(\textbf{X})$, such that

1. $T(\textbf{X})$ is a sufficient statistic
2. if $U(\textbf{X})$ is any other sufficient statistic, then there exists a function $g$ such that

$T(\textbf{X}) = g[U(\textbf{X})].$

Note that, if there exists a one-to-one function $h$ such that

$T(\textbf{X}) = h[U(\textbf{X})],$

then $T(\textbf{X})$ and $U(\textbf{X})$ are equivalent.