## Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data $\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let $t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.

If $E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,$

implies that $P \{g[T(\mathbf{X})]\} = 0] = 1,$

then $T(\mathbf{X})$ is said to be complete.  To deconstruct this esoteric mathematical statement,

1. let $g(t)$ be a measurable function
2. if you want to use $g[T(\mathbf{X})]$ to form an unbiased estimator of the zero function,
3. and if the only such function is almost surely equal to the zero function,
4. then $T(\mathbf{X})$ is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

## Organic Chemistry Lesson of the Day – The 2 Conformational Isomers of Ethane

The simplest case of conformational isomerism belongs to ethane, C2H6.

Image courtesy of Mr.Holmium via Wikimedia.

In the Newman projections above, you can see that the dihedral angle between any 2 vicinal hydrogens plays a key role in the stability of ethane.  In particular, there are 2 extrema in that plot of the change in Gibbs free energy vs. the dihedral angle:

• The minimum is attained when the dihedral angle is $180 \times (2n + 1) \div 3$ degrees, where $n$ is any integer $(n = 0, \pm 1, \pm 2, \pm 3, ...)$.  In other words, the vicinal hydrogens are as far apart from each other as possible.  This conformation is called the staggered conformation.
• The maximum is attained when the dihedral angle is $180 \times (2n) \div 3$ degrees, where $n$ is any integer $(n = 0, \pm 1, \pm 2, \pm 3, ...)$.  In other words, the vicinal hydrogens are as close to each other as possible.  This conformation is called the eclipsed conformation.

The stability of ethane is dependent on this dihedral angle.

• If the vicinal hydrogens are far part from each other (in a staggered conformation, for example), then there is less torsional strain* between the 2 carbon-hydrogen bonds, resulting in more stability.
• If the vicinal hydrogens are close to each other (in an eclipsed conformation, for example), then there is greater torsional strain* between the 2 carbon-hydrogen bonds resulting in less stability.

*In my undergraduate education, I learned that the greater stability in the staggered conformation is due to less torsional (steric) strain.  However, Vojislava Pophristic & Lionel Goodman (2001) argued that the effect is actually due to the stabilizing effect of hyperconjugation.  Song et al. (2005) and Mo and Yao (2007) rebutted this argument in separate publications.  Read these articles as searched under “ethane hyperconjugation steric strain” on Google Scholar for more information.

#### References

• Pophristic, V., & Goodman, L. (2001). Hyperconjugation not steric repulsion leads to the staggered structure of ethane. Nature, 411(6837), 565-568.
• Song, L., Lin, Y., Wu, W., Zhang, Q., & Mo, Y. (2005). Steric strain versus hyperconjugative stabilization in ethane congeners. The Journal of Physical Chemistry A, 109(10), 2310-2316.
• Mo, Y., & Gao, J. (2007). Theoretical analysis of the rotational barrier of ethane. Accounts of chemical research, 40(2), 113-119.

## Performing Logistic Regression in R and SAS

#### Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

#### The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
2. treatment – Whether or not the patient completed an anger control treatment program.
3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

## Organic and Inorganic Chemistry Lesson of the Day – Conformational Isomers (or Conformers)

Conformational isomerism is a special type of stereoisomerism that arises from the rotation of a single bond.  Specifically, 2 molecules are conformational isomers (or conformers) if they can be interconverted exclusively by the rotation of a single bond.  This type of isomerism differs from configurational stereoisomerism, whose isomers can only be interconverted by breaking certain bonds and reattaching* them to produce different 3-dimensional orientations.  Examples of configurational isomers include enantiomers, diastereomers, cis/trans isomers and meso isomers.

Different conformers are notable for having different stabilities, depending on the electrostatic interactions between the substituents along the single bond of interest.  I will talk about these differences in greater depth in future Chemistry Lessons of the Day.

*Such reattachment of the bonds must not result in different connectivities (or sequence of bonds); otherwise, that would result in structural isomers.

## Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

## Vancouver SAS User Group Meeting – Wednesday, November 26, 2014, at Holiday Inn Vancouver-Centre (West Broadway)

I am pleased to have recently joined the executive organizing team of the Vancouver SAS User Group.  We hold meetings twice per year to allow Metro Vancouver users of all kinds of SAS products to share their knowledge, tips and advice with others.  These events are free to attend, but registration is required. Our next meeting will be held on Wednesday, November 26, 2014.  Starting from 8:30 am, a free breakfast will be served while registration takes place.  The session will begin at 9:00 am and end at 12:30 pm with a prize draw.

Please note that there is a new location for this meeting: the East and Centre Ballrooms at Holiday Inn Vancouver-Centre at 711 West Broadway in Vancouver.  We will also experiment with holding a half-day session by ending at 12:30 pm at this meeting.  Visit our web site for more information and to register for this free event!

If you will attend this event, please feel free to come and say “Hello”!

Read the rest of this post for the full agenda!

## Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, $X_1, X_2, ..., X_n$, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all $n$ variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, $T(\textbf{X})$, such that

1. $T(\textbf{X})$ is a sufficient statistic
2. if $U(\textbf{X})$ is any other sufficient statistic, then there exists a function $g$ such that $T(\textbf{X}) = g[U(\textbf{X})].$

Note that, if there exists a one-to-one function $h$ such that $T(\textbf{X}) = h[U(\textbf{X})],$

then $T(\textbf{X})$ and $U(\textbf{X})$ are equivalent.

## Extracting the Postal Codes from Addresses of Hospitals in British Columbia – An Exercise in SAS Text Processing

#### Introduction

In my job as a Biostatistical Analyst at the British Columbia (BC) Cancer Agency in Vancouver, I recently needed to get the postal codes for the hospitals in BC.  I found a data table of the hospitals with their addresses, but I needed to extract the postal codes from the addresses.  In this tutorial, I will show you some text processing techniques in SAS that I used to extract the postal codes from that raw data file.

* This blog post contains information licensed under the Open Government License – British Columbia.

Read the rest of this post to get the SAS code for extracting the postal codes and the final spreadsheet that contains the postal codes of the hospitals in British Columbia!

## Café Scientifique – Materials Science Seminar by Neil Branda on Wednesday, November 19, 2014

If you will attend the following seminar, please do come and say “Hello”!  The event is free, but registration is required!  For more information, visit the SFU Café Scientifique’s web site! Time: 7:00 -8:30 pm

Date: Wednesday, November 19, 2014

Place: Boston Pizza, 1045 Columbia Street, New Westminster, BC

Title: It’s a Materials World – From Sticks and Stones to Nanotechnology, how materials have changed our world

Speaker: Neil Branda – Professor of Chemistry at Simon Fraser University, Executive Director of 4D LABS, and Chief Technology Officer of SWITCH Materials

Abstract:

Since the beginning, understanding how materials can be used for specific tasks has resulted in some of the biggest changes to civilizations. Modern society is becoming more and more dependent on the development and use of advanced materials. From the basics to the controversial, how materials have affected they way we live and play will be discussed.

Biography of Speaker:

Dr. Neil Branda is a professor of Chemistry and a Canada Research Chair at Simon Fraser University, the Executive Director of 4D LABS, a research centre for advanced materials and nano-scale devices, CTO of SWITCH Materials Inc., a company he founded to commercialize his molecular switching technology and Founder and Director of the NanoCommunity Canada Research Network, a community of nanotechnology researchers committed to sharing knowledge and working collaboratively to advance applications in medical diagnostics, therapeutics, renewable energy and advanced materials.

## Mathematical Statistics Lesson of the Day – Sufficient Statistics

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data $\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let $t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.  Let $g_\theta(t)$ be the PDF for $T(X)$.

If the conditional PDF $h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]$

is independent of $\theta$, then $T(\mathbf{X})$ is a sufficient statistic for $\theta$.  In other words, $h_\theta(\mathbf{X}) = h(\mathbf{X})$,

and $\theta$ does not appear in $h(\mathbf{X})$.

Intuitively, this means that $T(\mathbf{X})$ contains everything you need to estimate $\theta$, so knowing $T(\mathbf{X})$ (i.e. conditioning $f_\theta(x)$ on $T(\mathbf{X})$) is sufficient for estimating $\theta$.

Often, the sufficient statistic for $\theta$ is a summary statistic of $X_1, X_2, ..., X_n$, such as their

• sample mean
• sample median – removed thanks to comment by Christian Robert (Xi’an)
• sample minimum
• sample maximum

If such a summary statistic is sufficient for $\theta$, then knowing this one statistic is just as useful as knowing all $n$ data for estimating $\theta$.

*This above definition holds for discrete and continuous random variables.

## Interview with SFU Office of Graduate Studies & Postdoctoral Fellows: Using Social Media to Advance Your Career

Jackie Amsden, the Coordinator of Postdoctoral Fellows & Professional Development Programs in the Office of Graduate Studies & Postdoctoral Fellows at Simon Fraser University (SFU), recently asked me to share my experience in using blogging and social media to advance my career.  I am pleased to have shared my advice with Jackie in an interview, and she summarized our conversation in a blog post.  I am especially delighted to hear that my advice generated valuable discussion about professional development for a new group of graduate students and post-doctoral fellows during their orientation at SFU. Jackie and other members of her team have written a series of blog posts on professional development for graduate students and post-doctoral fellows – check it out!  You can follow Jackie on Twitter @jackiecamsden.

It is always a pleasure to give back to my alma mater and help university students to develop their careers!  Thanks, Jackie!