## Career Seminar at Department of Statistics and Actuarial Science, Simon Fraser University: 1:30 – 2:20 pm, Friday, February 20, 2015

I am very pleased to be invited to speak to the faculty and students in the Department of Statistics and Actuarial Science at Simon Fraser University on this upcoming Friday.  I look forward to sharing my career advice and answering questions from the students about how to succeed in a career in statistics.  If you will attend this seminar, please feel free to come and say “Hello”!

## The advantages of using count() to get N-way frequency tables as data frames in R

#### Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R.  Several commenters provided alternative ways of doing so, and they are all appreciated.  Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

#### 2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation.  Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
gear
cyl      3     4     5
4        1     8     2
6        2     4     1
8        12    0     2

## Mathematical Statistics Lesson of the Day – Complete Statistics

The set-up for today’s post mirrors my earlier Statistics Lesson of the Day on sufficient statistics.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.

If

$E_\theta \{g[T(\mathbf{X})]\} = 0, \ \ \forall \ \theta,$

implies that

$P \{g[T(\mathbf{X})]\} = 0] = 1,$

then $T(\mathbf{X})$ is said to be complete.  To deconstruct this esoteric mathematical statement,

1. let $g(t)$ be a measurable function
2. if you want to use $g[T(\mathbf{X})]$ to form an unbiased estimator of the zero function,
3. and if the only such function is almost surely equal to the zero function,
4. then $T(\mathbf{X})$ is a complete statistic.

I will discuss the intuition behind this bizarre definition in a later Statistics Lesson of the Day.

*This above definition holds for discrete and continuous random variables.

## Performing Logistic Regression in R and SAS

#### Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

#### The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

1. ha2  - Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
2. treatment – Whether or not the patient completed an anger control treatment program.
3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

## Christian Robert Shows that the Sample Median Cannot Be a Sufficient Statistic

I am grateful to Christian Robert (Xi’an) for commenting on my recent Mathematical Statistics Lessons of the Day on sufficient statistics and minimally sufficient statistics.

In one of my earlier posts, he wisely commented that the sample median cannot be a sufficient statistic.  He has supplemented this by writing on his own blog to show that the median cannot be a sufficient statistic.

Thank you, Christian, for your continuing readership and contribution.  It’s a pleasure to learn from you!

## Mathematical Statistics Lesson of the Day – Minimally Sufficient Statistics

In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter.  Indeed, the entire data set, $X_1, X_2, ..., X_n$, can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter.  However, using all $n$ variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic.  If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it.

Our saviour for this problem is a minimally sufficient statistic.  This is defined as a statistic, $T(\textbf{X})$, such that

1. $T(\textbf{X})$ is a sufficient statistic
2. if $U(\textbf{X})$ is any other sufficient statistic, then there exists a function $g$ such that

$T(\textbf{X}) = g[U(\textbf{X})].$

Note that, if there exists a one-to-one function $h$ such that

$T(\textbf{X}) = h[U(\textbf{X})],$

then $T(\textbf{X})$ and $U(\textbf{X})$ are equivalent.

## Mathematical Statistics Lesson of the Day – Sufficient Statistics

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$.  Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$.

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$.  Let $g_\theta(t)$ be the PDF for $T(X)$.

If the conditional PDF

$h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]$

is independent of $\theta$, then $T(\mathbf{X})$ is a sufficient statistic for $\theta$.  In other words,

$h_\theta(\mathbf{X}) = h(\mathbf{X})$,

and $\theta$ does not appear in $h(\mathbf{X})$.

Intuitively, this means that $T(\mathbf{X})$ contains everything you need to estimate $\theta$, so knowing $T(\mathbf{X})$ (i.e. conditioning $f_\theta(x)$ on $T(\mathbf{X})$) is sufficient for estimating $\theta$.

Often, the sufficient statistic for $\theta$ is a summary statistic of $X_1, X_2, ..., X_n$, such as their

• sample mean
• sample median – removed thanks to comment by Christian Robert (Xi’an)
• sample minimum
• sample maximum

If such a summary statistic is sufficient for $\theta$, then knowing this one statistic is just as useful as knowing all $n$ data for estimating $\theta$.

*This above definition holds for discrete and continuous random variables.

## Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

#### Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

## Online index of plots and corresponding R scripts

Dear Readers of The Chemical Statistician,

Joanna Zhao, an undergraduate researcher in the Department of Statistics at the University of British Columbia, produced a visual index of over 100 plots using ggplot2, the R package written by Hadley Wickham.

An example of a plot and its source R code on Joanna Zhao’s catalog.

Click on a thumbnail of any picture in this catalog – you will see the figure AND all of the necessary code to reproduce it.  These plots are from Naomi Robbins‘ book “Creating More Effective Graphs”.

If you

• want to produce an effective plot in R
• roughly know what the plot should look like
• but could really use an example to get started,

then this is a great resource for you!  A related GitHub repository has the code for ALL figures and the infrastructure for Joanna’s Shiny app.

I learned about this resource while working in my job at the British Columbia Cancer Agency; I am fortunate to attend a wonderful seminar series on statistics at the British Columbia Centre for Disease Control, and a colleague from this seminar told me about it.  By sharing this with you, I hope that it will immensely help you with your data visualization needs!

## Mathematics and Mathematical Statistics Lesson of the Day – Convex Functions and Jensen’s Inequality

Consider a real-valued function $f(x)$ that is continuous on the interval $[x_1, x_2]$, where $x_1$ and $x_2$ are any 2 points in the domain of $f(x)$.  Let

$x_m = 0.5x_1 + 0.5x_2$

be the midpoint of $x_1$ and $x_2$.  Then, if

$f(x_m) \leq 0.5f(x_1) + 0.5f(x_2),$

then $f(x)$ is defined to be midpoint convex.

More generally, let’s consider any point within the interval $[x_1, x_2]$.  We can denote this arbitrary point as

$x_\lambda = \lambda x_1 + (1 - \lambda)x_2,$ where $0 < \lambda < 1$.

Then, if

$f(x_\lambda) \leq \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be convex.  If

$f(x_\lambda) < \lambda f(x_1) + (1 - \lambda) f(x_2),$

then $f(x)$ is defined to be strictly convex.

There is a very elegant and powerful relationship about convex functions in mathematics and in mathematical statistics called Jensen’s inequality.  It states that, for any random variable $Y$ with a finite expected value and for any convex function $g(y)$,

$E[g(Y)] \geq g[E(Y)]$.

A function $f(x)$ is defined to be concave if $-f(x)$ is convex.  Thus, Jensen’s inequality can also be stated for concave functions.  For any random variable $Z$ with a finite expected value and for any concave function $h(z)$,

$E[h(Z)] \leq h[E(Z)]$.

In future Statistics Lessons of the Day, I will prove Jensen’s inequality and discuss some of its implications in mathematical statistics.

## Mathematical Statistics Lesson of the Day – The Glivenko-Cantelli Theorem

In 2 earlier tutorials that focused on exploratory data analysis in statistics, I introduced

There is actually an elegant theorem that provides a rigorous basis for using empirical CDFs to estimate the true CDF – and this is true for any probability distribution.  It is called the Glivenko-Cantelli theorem, and here is what it states:

Given a sequence of $n$ independent and identically distributed random variables, $X_1, X_2, ..., X_n$,

$P[\lim_{n \to \infty} \sup_{x \epsilon \mathbb{R}} |\hat{F}_n(x) - F_X(x)| = 0] = 1.$

In other words, the empirical CDF of $X_1, X_2, ..., X_n$ converges uniformly to the true CDF.

My mathematical statistics professor at the University of Toronto, Keith Knight, told my class that this is often referred to as “The First Theorem of Statistics” or the “The Fundamental Theorem of Statistics”.  I think that this is a rather subjective title – the central limit theorem is likely more useful and important – but Page 261 of John Taylor’s An introduction to measure and probability (Springer, 1997) recognizes this attribution to the Glivenko-Cantelli theorem, too.

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

Markov’s inequality roughly says that a random variable $X$ is most frequently observed near its expected value, $\mu$.  Remarkably, it quantifies just how often $X$ is far away from $\mu$.  Chebyshev’s inequality goes one step further and quantifies that distance between $X$ and $\mu$ in terms of the number of standard deviations away from $\mu$.  It roughly says that the probability of $X$ being $k$ standard deviations away from $\mu$ is at most $k^{-2}$.  Notice that this upper bound decreases as $k$ increases – confirming our intuition that it is highly improbable for $X$ to be far away from $\mu$.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable $X$, as long as $E(X)$ and $V(X)$ are finite.  (Markov’s inequality requires only $E(X)$ to be finite.)  This is quite a marvelous result!

## Mathematical Statistics Lesson of the Day – Chebyshev’s Inequality

The variance of a random variable $X$ is just an expected value of a function of $X$.  Specifically,

$V(X) = E[(X - \mu)^2], \ \text{where} \ \mu = E(X)$.

Let’s substitute $(X - \mu)^2$ into Markov’s inequality and see what happens.  For convenience and without loss of generality, I will replace the constant $c$ with another constant, $b^2$.

$\text{Let} \ b^2 = c, \ b > 0. \ \ \text{Then,}$

$P[(X - \mu)^2 \geq b^2] \leq E[(X - \mu)^2] \div b^2$

$P[ (X - \mu) \leq -b \ \ \text{or} \ \ (X - \mu) \geq b] \leq V(X) \div b^2$

$P[|X - \mu| \geq b] \leq V(X) \div b^2$

Now, let’s substitute $b$ with $k \sigma$, where $\sigma$ is the standard deviation of $X$.  (I can make this substitution, because $\sigma$ is just another constant.)

$\text{Let} \ k \sigma = b. \ \ \text{Then,}$

$P[|X - \mu| \geq k \sigma] \leq V(X) \div k^2 \sigma^2$

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

This last inequality is known as Chebyshev’s inequality, and it is just a special version of Markov’s inequality.  In a later Statistics Lesson of the Day, I will discuss the motivation and intuition behind it.  (Hint: Read my earlier lesson on the motivation and intuition behind Markov’s inequality.)

## Getting Ready for Mathematical Classes in the New Semester – Guest-Blogging on SFU’s Career Services Informer

The following blog post was slightly condensed for editorial brevity and then published on the Career Services Informer, the official blog of the Career Services Centre at my undergraduate alma mater, Simon Fraser University

As a new Fall semester begins, many students start courses such as math, physics, computing science, engineering and statistics.  These can be tough classes with a rapid progression in workload and difficulty, but steady preparation can mount a strong defense to the inevitable pressure and stress.  Here are some tips to help you to get ready for those classes.

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

$P(X \geq c) \leq E(X) \div c,$ where $c > 0$.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable $X$ “far” away from its “centre” or “central value”?

Intuitively, the “central value” of $X$ is the value that of $X$ that is most commonly (or most frequently) observed.  Thus, as $X$ deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, $E(X)$, is a measure of the “centre” of $X$.  Thus, we would expect that the probability of $X$ being very far away from $E(X)$ is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As $c$ becomes really far away from $E(X)$, the event $X \geq c$ becomes less probable.

You can confirm this by substituting several key values of $c$.

• If $c = E(X)$, then $P[X \geq E(X)] \leq 1$; this is the highest upper bound that $P(X \geq c)$ can get.  This makes intuitive sense; $X$ is going to be frequently observed near its own expected value.

• If $c \rightarrow \infty$, then $P(X \geq \infty) \leq 0$.  By Kolmogorov’s axioms of probability, any probability must be inclusively between $0$ and $1$, so $P(X \geq \infty) = 0$.  This makes intuitive sense; there is no possible way that $X$ can be bigger than positive infinity.

## The Chi-Squared Test of Independence – An Example in Both R and SAS

#### Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, $X$ and $Y$, the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and alternative hypotheses:

$H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y$

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities.  For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

## Applied Statistics Lesson of the Day – The Coefficient of Variation

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set.  However, consider the following 2 hypothetical cases:

1. the standard deviation for the incomes of households in Canada is $2,000 2. the standard deviation for the incomes of the 5 major banks in Canada is$2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas$2,000 is not a lot of money for one of the “Big Five” banks.  Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets.  One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

$\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},$

where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

Analogously, the coefficient of variation for a random variable is

$\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,$

where $\sigma$ is the random variable’s standard deviation and $\mu$ is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes.  I hope that all new statistics students get to learn this alternative measure of dispersion.

## Using Your Vacation to Develop Your Career – Guest Blogging on Simon Fraser University’s Career Services Informer

The following post was originally published on the Career Services Informer.

I recently took a vacation from my former role as a statistician at the BC Centre for Excellence in HIV/AIDS. I did not plan a trip out of town – the spring weather was beautiful in Vancouver, and I wanted to spend time on the things that I like to do in this city. Many obvious things came to mind – walking along beaches, practicing Python programming and catching up with friends – just to name a few.

Yes, Python programming was one of the obvious things on my vacation to-do list, and I understand how ridiculous this may seem to some people. Why tax my brain during a time that is meant for mental relaxation, especially when the weather is great?

## Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

• its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

$\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}$

• its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

$\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}$

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

## Video Tutorial – Allelic Frequencies Remain Constant From Generation to Generation Under the Hardy-Weinberg Equilibrium

The Hardy-Weinberg law is a fundamental principle in statistical genetics.  If its 7 assumptions are fulfilled, then it predicts that the allelic frequency of a genetic trait will remain constant from generation to generation.  In this new video tutorial in my Youtube channel, I explain the math behind the Hardy-Weinberg theorem.  In particular, I clarify the origin of the connection between allelic frequencies and genotyopic frequencies in the second generation – I have not found a single textbook or web site on this topic that explains this calculation, so I hope that my explanation is helpful to you.

You can also watch the video below the fold!