Eric’s Enlightenment for Thursday, April 30, 2015

  1. Simon Jackman from Stanford University provides some simple examples of obtaining the posterior distribution using conjugate priors.  If you are new to Bayesian statistics and need to develop the intuition for the basic ideas, then work through the math in these examples with pen and paper.
  2. Did you know that there are plastics that conduct electricity?  In fact, Alan J. Heeger, Alan G. MacDiarmid and Hideki Shirakawa won the 2000 Nobel Prize in Chemistry for the work on this fascinating subject.
  3. Jared Niemi provides a nice video introduction of mixed-effects models.  I highly encourage you to work through the math with pen and paper.
  4. Alberto Cairo adds a healthy dose of caution about the recent advent of data-driven journalism.  He emphasizes problems like confusing correlation with causation, ecological fallacies, and drawing conclusions based on small sample sizes or unrepresentative samples.

Eric’s Enlightenment for Wednesday, April 29, 2015

  1. Anscombe’s quartet is a collection of 4 data sets that have almost identical summary statistics but appear very differently when plotted.  They illustrate the importance of visualizing your data first before plugging them into a statistical model.
  2. A potential geochemical explanation for the existence of Blood Falls, an outflow of saltwater tainted with iron (III) oxide at the snout of the Taylor Glacier in Antarctica.  Here is the original Nature paper by Jill Mikucki et al.
  3. Jonathan Rothwell and Siddharth Kulkarni from the Brookings Institution use a value-added approach to rank 2-year and 4-year post-secondary institutions in the USA.  Some of the top-ranked universities by this measure are lesser known schools like Colgate University, Rose-Hulman Institute of Technology, and Carleton College.  I would love to see something similar for Canada!
  4. Heather Krause from Datassist provides tips on how to avoid (accidentally) lying with your data.  Do read the linked sources of further information!

Eric’s Enlightenment for Tuesday, April 28, 2015

  1. On a yearly basis, the production of almonds in California uses more water than businesses and residences in San Francisco and Los Angeles combined.  Alex Tabarrok explains why.
  2. How patient well-being and patient satisfaction become conflicting objectives in hospitals – a case study of a well-intended policy with deadly consequences.  (HT: Frances Woolley – with a thought about academia.)
  3. Contrary to a long-held presumption about the stability of DNA in mature cells, Huimei Yu et al. show that neurons use DNA methylation to rewrite their DNA throughout each day.  This is done to adjust the brain to different activity levels as its function changes over time.
  4. Alex Yakubovitch provides a tutorial on regular expressions (patterns that define sets of strings) and how to use them in R.

Eric’s Enlightenment for Friday, April 24, 2015

  1. Anna Katherine Barnett-Hart wrote an empirical study of how collateralized debt obligations contributed to $542 billion in losses suffered by financial institutions during the sub-prime mortgage crisis.  This was her honours thesis for her Bachelor of Arts degree at Harvard.  Here is her interview about her work with Dylan Ratigan.
  2. Instead of money, Toyota offered its engineers to help The Food Bank for New York City to improve its operations.  Thanks to their guidance, the wait time for dinner was cut from 90 minutes to 18 minutes.
  3. Unusual, simple, yet effective: Dispensing with loyalty cards or rewards programs, Pret A Manger spontaneously offers free food to reward loyal customers.
  4. A fast and simple macro for getting the sample size of a data set in SAS.  (No need to scroll through PROC CONTENTS in case you have many variables.)
  5. Jon Chui provides some useful pictorial guides for interpreting infrared spectra and proton nuclear magnetic resonance spectra.

Eric’s Enlightenment for Thursday, April 23, 2015

  1. Reaching the NBA Finals has been much more difficult in the Western Conference than in the Eastern Conference in the past 15 years.
  2. In terms of points above average shooter per 100 shots, Kyle Korver ranks first in 2014-2015 with +30.4 points.  DeAndre Jordan ranks second with +17.4 points.  (Incredible!)
  3. Evan Soltas evaluates “the rent hypothesis” – the claim that a larger share of income in recent years are unearned gains.  (More rigorous, rent is “a payment for a resource in excess of its opportunity cost, one that instead reflects market power”.)  This is Evan’s most read article.
  4. A research team led by Junjiu Huang from 中山大学 (Sun Yat-Sen University) have successfully “edited the genes of human embryos using a new technique called CRISPR”.  Carl Zimmer provides some background.  (HT: Tyler Cowen.)

Eric’s Enlightenment for Wednesday, April 22, 2015

  1. Frances Woolley’s useful reading list on tax policy for Canadians with disabilities
  2. Jeff Rosenthal asked a seemingly simple yet subtle question about uncorrelated normal random variables.
  3. A great catalogue of colours with their names in R – very useful for data visualization!
  4. Paul Crutzen’s proposed scheme to inject sulfur dioxide into the stratosphere – this would create sulfate aerosols for deflecting sunlight to counteract global warming, but he carefully weighed the serious pros and cons of this risky scheme.

Eric’s Enlightenment for Tuesday, April 21, 2015

  1. The standard Gibbs free energy of the conversion of water from a liquid to a gas is positive.  Why does it still evaporate at room temperature?  Very good answer on Chemistry Stack Exchange.
  2. The Difference Between Clustered, Longitudinal, and Repeated Measures Data.  Good blog post by Karen Grace-Martin.
  3. 25 easy and inexpensive ways to clean household appliances using simple (and non-toxic) household products.
  4. A nice person named Alex kindly transcribed the notes for all of Andrew Ng’s video lectures in his course on machine learning at Coursera.

Eric’s Enlightenment for Monday, April 20, 2015

  1. John D. Cook explains why 0! is defined to be equal to 1.  This is also an excellent post on how definitions are created in mathematics.
  2. Why are GDP estimates often unreliable?  (Jonathan Jones wrote this report for Britain, but it’s likely applicable to all countries.)
  3. Rick Wicklin shares useful code for counting the number of missing and non-missing observations in a data set in SAS.
  4. A potentially game-changing breakthrough in artificial photosynthesis may be able to solve the world’s carbon emission problem…

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them.  By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;
data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;

First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2.  I will export this set of observations into a data set called mismatches1, and I will print it for your viewing.  The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2.  The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

Read more of this post

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step.  My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicated observations, but that requires sorting the data set first before using the DATA step to do the separation.  If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS.  It has an identifier called MSGID, and there are duplicates by MSGID.  Let’s create 2 data sets out of SASHELP.ADOMSG:

  • DUPLICATES for storing the duplicated observations
  • SINGLES for storing the unique observations
proc sort
     data = sashelp.adomsg
          out = duplicates
          uniqueout = singles
          nouniquekey;
     by msgid;
run;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.