April | 2015 | The Chemical Statistician

Eric’s Enlightenment for Thursday, April 30, 2015

April 30, 2015 Leave a comment

Simon Jackman from Stanford University provides some simple examples of obtaining the posterior distribution using conjugate priors. If you are new to Bayesian statistics and need to develop the intuition for the basic ideas, then work through the math in these examples with pen and paper.
Did you know that there are plastics that conduct electricity? In fact, Alan J. Heeger, Alan G. MacDiarmid and Hideki Shirakawa won the 2000 Nobel Prize in Chemistry for the work on this fascinating subject.
Jared Niemi provides a nice video introduction of mixed-effects models. I highly encourage you to work through the math with pen and paper.
Alberto Cairo adds a healthy dose of caution about the recent advent of data-driven journalism. He emphasizes problems like confusing correlation with causation, ecological fallacies, and drawing conclusions based on small sample sizes or unrepresentative samples.

Filed under Eric's Enlightenment Tagged with Alan G. MacDiarmid, alan heeger, Alan J. Heeger, alan macdiarmid, chemistry, conductivity, data journalism, data-driven journalism, electricity, Hideki Shirakawa, jared niemi, mixed effects, mixed-effects models, Nobel Prize, plastics, simon jackman, statistics

Eric’s Enlightenment for Wednesday, April 29, 2015

April 29, 2015 Leave a comment

Anscombe’s quartet is a collection of 4 data sets that have almost identical summary statistics but appear very differently when plotted. They illustrate the importance of visualizing your data first before plugging them into a statistical model.
A potential geochemical explanation for the existence of Blood Falls, an outflow of saltwater tainted with iron (III) oxide at the snout of the Taylor Glacier in Antarctica. Here is the original Nature paper by Jill Mikucki et al.
Jonathan Rothwell and Siddharth Kulkarni from the Brookings Institution use a value-added approach to rank 2-year and 4-year post-secondary institutions in the USA. Some of the top-ranked universities by this measure are lesser known schools like Colgate University, Rose-Hulman Institute of Technology, and Carleton College. I would love to see something similar for Canada!
Heather Krause from Datassist provides tips on how to avoid (accidentally) lying with your data. Do read the linked sources of further information!

Filed under Eric's Enlightenment Tagged with Anscombe's quartet, antarctica, brookings institution, chemistry, data, Data Visualization, datassist, geochemistry, heather krause, iron (iii) oxide, jonathan rothwell, plotting, siddharth kulkarni, statistics, taylor glacier, universities

Eric’s Enlightenment for Tuesday, April 28, 2015

April 28, 2015 Leave a comment

On a yearly basis, the production of almonds in California uses more water than businesses and residences in San Francisco and Los Angeles combined. Alex Tabarrok explains why.
How patient well-being and patient satisfaction become conflicting objectives in hospitals – a case study of a well-intended policy with deadly consequences. (HT: Frances Woolley – with a thought about academia.)
Contrary to a long-held presumption about the stability of DNA in mature cells, Huimei Yu et al. show that neurons use DNA methylation to rewrite their DNA throughout each day. This is done to adjust the brain to different activity levels as its function changes over time.
Alex Yakubovitch provides a tutorial on regular expressions (patterns that define sets of strings) and how to use them in R.

Filed under Eric's Enlightenment Tagged with affordable care act, alex tabarrok, alex yakubovitch, almonds, brain, california, DNA, dna methylation, economics, hospitals, los angeles, neurons, R, R programming, regular expressions, san francisco, water

Eric’s Enlightenment for Monday, April 27, 2015

April 27, 2015 Leave a comment

How much wealth do Canadians have? What types of assets and debts do Canadians have? Here is a very useful table to answer those questions from Statistics Canada.
Here is the Canadian Parliamentary Budget Officer’s assessment of the tax-free savings account (2015-02-24).
Tracey Weissberger et al. argue persuasively for fundamental changes to the visualization of continuous data in academic publications. Instead of bar plots, use scatter plots, box plots, and histograms, which better allow readers to examine the distribution of the data.
Ram B. Gupta et al. have developed a new way to clean crude oil spills using soy.

Filed under Eric's Enlightenment Tagged with bar plot, box plot, continuous data, Data Visualization, histogram, oil spill, scatter plot, soy, statistics, tax-free savings account, tfsa

Eric’s Enlightenment for Friday, April 24, 2015

April 24, 2015 Leave a comment

Anna Katherine Barnett-Hart wrote an empirical study of how collateralized debt obligations contributed to $542 billion in losses suffered by financial institutions during the sub-prime mortgage crisis. This was her honours thesis for her Bachelor of Arts degree at Harvard. Here is her interview about her work with Dylan Ratigan.
Instead of money, Toyota offered its engineers to help The Food Bank for New York City to improve its operations. Thanks to their guidance, the wait time for dinner was cut from 90 minutes to 18 minutes.
Unusual, simple, yet effective: Dispensing with loyalty cards or rewards programs, Pret A Manger spontaneously offers free food to reward loyal customers.
A fast and simple macro for getting the sample size of a data set in SAS. (No need to scroll through PROC CONTENTS in case you have many variables.)
Jon Chui provides some useful pictorial guides for interpreting infrared spectra and proton nuclear magnetic resonance spectra.

Filed under Eric's Enlightenment Tagged with ak barnett-hart, Anna Katherine Barnett-Hart, cdo, collateralized debt obligations, food bank for new york city, infrared spectra ir spectroscopy, infrared spectroscopy, ir spectra, jon chui, nmr, nmr spectra, nmr spectroscopy, nuclear magnetic resonance spectra, nuclear magnetic resonance spectroscopy, pret a manger, PROC CONTENTS, sample size, SAS, sub-prime mortgage crisis, toyota

Eric’s Enlightenment for Thursday, April 23, 2015

April 23, 2015 Leave a comment

Reaching the NBA Finals has been much more difficult in the Western Conference than in the Eastern Conference in the past 15 years.
In terms of points above average shooter per 100 shots, Kyle Korver ranks first in 2014-2015 with +30.4 points. DeAndre Jordan ranks second with +17.4 points. (Incredible!)
Evan Soltas evaluates “the rent hypothesis” – the claim that a larger share of income in recent years are unearned gains. (More rigorous, rent is “a payment for a resource in excess of its opportunity cost, one that instead reflects market power”.) This is Evan’s most read article.
A research team led by Junjiu Huang from 中山大学 (Sun Yat-Sen University) have successfully “edited the genes of human embryos using a new technique called CRISPR”. Carl Zimmer provides some background. (HT: Tyler Cowen.)

Filed under Eric's Enlightenment Tagged with analytics, basketball, carl zimmer, crispr, deandre jordan, DNA, embryo, gene, human embryo, kyle korver, nba, nba finals, sports analytics, sun yat-sen university

Eric’s Enlightenment for Wednesday, April 22, 2015

April 22, 2015 Leave a comment

Frances Woolley’s useful reading list on tax policy for Canadians with disabilities
Jeff Rosenthal asked a seemingly simple yet subtle question about uncorrelated normal random variables.
A great catalogue of colours with their names in R – very useful for data visualization!
Paul Crutzen’s proposed scheme to inject sulfur dioxide into the stratosphere – this would create sulfate aerosols for deflecting sunlight to counteract global warming, but he carefully weighed the serious pros and cons of this risky scheme.

Filed under Eric's Enlightenment Tagged with correlation, Data Visualization, disability, frances woolley, global warming, jeff rosenthal, normal distribution, paul crutzen, plot, plots, plotting, R, R programming, sulfate aerosol, sulfur dioxide, tax policy

Eric’s Enlightenment for Tuesday, April 21, 2015

April 21, 2015 Leave a comment

The standard Gibbs free energy of the conversion of water from a liquid to a gas is positive. Why does it still evaporate at room temperature? Very good answer on Chemistry Stack Exchange.
The Difference Between Clustered, Longitudinal, and Repeated Measures Data. Good blog post by Karen Grace-Martin.
25 easy and inexpensive ways to clean household appliances using simple (and non-toxic) household products.
A nice person named Alex kindly transcribed the notes for all of Andrew Ng’s video lectures in his course on machine learning at Coursera.

Filed under Eric's Enlightenment Tagged with Andrew Ng, chemistry, cleaning, clustered data, Coursera, evaporation, Gibbs free energy, household appliances, Karen Grace-Martin, longitudinal data, machine learning, repeated measures data, water

Eric’s Enlightenment for Monday, April 20, 2015

April 20, 2015 2 Comments

John D. Cook explains why $0!$ is defined to be equal to $1$ . This is also an excellent post on how definitions are created in mathematics.
Why are GDP estimates often unreliable? (Jonathan Jones wrote this report for Britain, but it’s likely applicable to all countries.)
Rick Wicklin shares useful code for counting the number of missing and non-missing observations in a data set in SAS.
“A potentially game-changing breakthrough in artificial photosynthesis may be able to solve the world’s carbon emission problem…“

Filed under Eric's Enlightenment Tagged with 0!, artificial photosynthesis, confidence interval, confidence intervals, GDP, math, missing values, photosynthesis, SAS

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

April 17, 2015 2 Comments

A common task in data analysis is to compare 2 data sets and determine the uncommon rows between them. By “uncommon rows”, I mean rows whose identifier value exists in one data set but not the other. In this tutorial, I will demonstrate how to do so using PROC SQL.

Let’s create 2 data sets.

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;

data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;

First, let’s identify the observations in dataset1 whose ID variable values don’t exist in dataset2. I will export this set of observations into a data set called mismatches1, and I will print it for your viewing. The logic of the code is simple – find the IDs in dataset1 that are not in the IDs in dataset2. The code “select *” ensures that all columns from dataset1 are used to create the data set in mismatches1.

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

April 10, 2015 5 Comments

As Fareeza Khurshed commented in my previous blog post, there is a new option in SAS 9.3 and later versions that allows sorting and the identification of duplicates to be done in one step. My previous trick uses FIRST.variable and LAST.variable to separate the unique observations from the duplicated observations, but that requires sorting the data set first before using the DATA step to do the separation. If you have SAS 9.3 or a newer version, here is an example of doing it in one step using PROC SORT.

There is a data set called ADOMSG in the SASHELP library that is built into SAS. It has an identifier called MSGID, and there are duplicates by MSGID. Let’s create 2 data sets out of SASHELP.ADOMSG:

DUPLICATES for storing the duplicated observations
SINGLES for storing the unique observations

proc sort
     data = sashelp.adomsg
          out = duplicates
          uniqueout = singles
          nouniquekey;
     by msgid;
run;

Here is the log:

NOTE: There were 459 observations read from the data set SASHELP.ADOMSG.
NOTE: 300 observations with unique key values were deleted.
NOTE: The data set WORK.DUPLICATES has 159 observations and 6 variables.
NOTE: The data set WORK.SINGLES has 300 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.28 seconds
cpu time 0.00 seconds

Note that the number of observations in WORK.DUPLICATES and WORK.SINGLES add to 459, the total number of observations in the original data set.

In addition to Fareeza, I also thank CB for sharing this tip.

Filed under Data Analysis, SAS Programming Tagged with data analysis, data manipulation, duplicate, first.variable, last.variable, nouniquekey, PROC SORT, SAS, sas programming, sorting, unique, uniqueout

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Eric’s Enlightenment for Thursday, April 30, 2015

Eric’s Enlightenment for Wednesday, April 29, 2015

Eric’s Enlightenment for Tuesday, April 28, 2015

Eric’s Enlightenment for Monday, April 27, 2015

Eric’s Enlightenment for Friday, April 24, 2015

Eric’s Enlightenment for Thursday, April 23, 2015

Eric’s Enlightenment for Wednesday, April 22, 2015

Eric’s Enlightenment for Tuesday, April 21, 2015

Eric’s Enlightenment for Monday, April 20, 2015

Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS

Separating Unique and Duplicated Observations Using PROC SORT in SAS 9.3 and Newer Versions

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories