Eric’s Enlightenment for Friday, May 29, 2015

  1. P2N3: An aromatic ion made of just phosphorous and nitrogen.  (Yes, aromaticity can be entirely inorganic!)
  2. Using 3-D printing and plastics to make prosthetics.
  3. David Beckwroth and Scott Sumner talk at length about reforming monetary policy with NGDP targeting in this video interview/seminar.
  4. Anky Lai gives a nice introduction to PROC TABULATE (PDF document) – an alternative to PROC FREQ and PROC MEANS in SAS.  Check out her awesome code samples for generating nicely formatted tables and exporting them conveniently into spreadsheets in Excel!

Getting All Duplicates of a SAS Data Set


A common task in data manipulation is to obtain all observations that appear multiple times in a data set – in other words, to obtain the duplicates.  It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates and unique observations.  I have written a new blog post to illustrate her solution.

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options.  However, the output data set from such a procedure does not have the first of each set of duplicates.  Here is an example.

Read more of this post

The Chi-Squared Test of Independence – An Example in Both R and SAS


The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, X and Y, the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and alternative hypotheses:

H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities.  For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

Read more of this post

Presentation on Statistical Genetics at Vancouver SAS User Group – Wednesday, May 28, 2014

I am excited and delighted to be invited to present at the Vancouver SAS User Group‘s next meeting.  I will provide an introduction to statistical genetics; specifically, I will

  • define basic terminology in genetics
  • explain the Hardy-Weinberg equilibrium in detail
  • illustrate how Pearson’s chi-squared goodness-of-fit test can be used in PROC FREQ in SAS to check the Hardy-Weinberg equilibrium
  • illustrate how the Newton-Raphson algorithm can be used for maximum likelihood estimation in PROC IML in SAS

Eric Cai - Official Head Shot








You can register for this meeting here.  The meeting’s coordinates are

9:00am – 3:00pm

Wednesday, May 28th, 2014

BC Cancer Agency Research Centre

675 West 10th Avenue.

Vancouver, BC


If you will attend this meeting, please feel free to come up and say “Hello!”.  I look forward to meeting you!