## Detecting Unfair Dice in Casinos with Bayes’ Theorem

#### Introduction

I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook.  I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively.

#### The Problem

The following question is a slightly modified version of Exercise #1.2 on Page 8 in “Biological Sequence Analysis” by Durbin, Eddy, Krogh and Mitchison.

An occasionally dishonest casino uses 2 types of dice.  Of its dice, 97% are fair but 3% are unfair, and a “five” comes up 35% of the time for these unfair dice.  If you pick a die randomly and roll it, how many “fives”  in a row would you need to see before it was most likely that you had picked an unfair die?”

Read more to learn how to create the following plot and how it invokes Bayes’ Theorem to solve the above problem!

## Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

#### Introduction

Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.

Previous posts in this series on EDA include

Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog!

## Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

#### Introduction

Data in R are often stored in data frames, because they can store multiple types of data.  (In R, data frames are more general than matrices, because matrices can only store one type of data.)  Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis.  I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.

If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!

This post continues a recent series on exploratory data analysis.  Previous posts in this series include

#### Useful Functions for Exploring Data Frames

Use dim() to obtain the dimensions of the data frame (number of rows and number of columns).  The output is a vector.

> dim(InsectSprays)
[1] 72 2

Use nrow() and ncol() to get the number of rows and number of columns, respectively.  You can get the same information by extracting the first and second element of the output vector from dim().

> nrow(InsectSprays)
# same as dim(InsectSprays)[1]
[1] 72
> ncol(InsectSprays)
# same as dim(InsectSprays)[2]
[1] 2

## Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series.  I will define and calculate the 5-number summary in 2 different ways that are commonly used in R.  (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.)  I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function.  This post expands on a recent comment that I made to correct an error in the post on box plots.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
Min.   1st Qu.   Median    Mean     3rd Qu.    Max.
1.0     3.5       6.0       6.0      8.5       11.0

Why do these 2 methods of calculating the 5–number summary in R give different results?  Read the rest of this post to find out the answer!

Previous posts in this series on EDA include

## Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

#### Introduction

This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots.  I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA).  Previous posts in this series on EDA include

## Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set.  I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.  In a later post, I will assess the distribution of the ”Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!

## Exploratory Data Analysis – Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

Update on July 15, 2013:

Thanks to Harlan Nelson for noting on AnalyticBridge that the ozone concentrations for both New York and Ozonopolis are non-negative quantities, so their kernel density plot should have non-negative support sets.  This has been corrected in this post by

- defining new variables called max.ozone and max.ozone2

- using the options “from = 0″ and “to = max.ozone” or “to = max.ozone2″ in the density() function when defining density.ozone and density.ozone2 in the R code.

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first half focused on the conceptual foundations of kernel density estimation.

#### Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation.  It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R.  It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

This is another post in a recent series on exploratory data analysis, which has included posts on descriptive statistics, box plots, violin plots, the conceptual foundations of empirical cumulative distribution functions (CDFs), and how to plot empirical CDFs in R.

Read the rest of this post to learn how to create the above combination of a kernel density plot and a rug plot!

## Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), and following up on the last post on the conceptual foundations of empirical cumulative distribution functions (CDFs), this post shows how to plot them in R.  (Previous posts in this series on EDA include descriptive statistics, box plots, kernel density estimation, and violin plots.)

I will plot empirical CDFs in 2 ways:

1. using the built-in ecdf() and plot() functions in R
2. calculating and plotting the cumulative probabilities against the ordered data

Continuing from the previous posts in this series on EDA, I will use the “Ozone” data from the built-in “airquality” data set in R.  Recall that this data set has missing values, and, just as before, this problem needs to be addressed when constructing plots of the empirical CDFs.

Recall the plot of the empirical CDF of random standard normal numbers in my earlier post on the conceptual foundations of empirical CDFs.  That plot will be compared to the plots of the empirical CDFs of the ozone data to check if they came from a normal distribution.

## Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

#### Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

## Exploratory Data Analysis: Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

#### Introduction

Recently, I began a series on exploratory data analysis (EDA), and I have written about descriptive statistics, box plots, and kernel density plots so far.  As previously mentioned in my post on box plots, there is a way to combine box plots and kernel density plots.  This combination results in violin plots, and I will show how to create them in R today.

Continuing from my previous posts on EDA, I will use 2 univariate data sets.  One is the “ozone” data vector that is part of the “airquality” data set that is built into R; this data set contains data on New York’s air pollution.  The other is a simulated data set of ozone pollution in a fictitious city called “Ozonopolis”.  It is important to remember that the ozone data from New York has missing values, and this has created complications that needed to be addressed in previous posts; missing values need to be addressed for violin plots, too, and in a different way than before.

The vioplot() command in the “vioplot” package creates violin plots; the plotting options in this function are different and less versatile than other plotting functions that I have used in R.  Thus, I needed to be more creative with the plot(), title(), and axis() functions to create the plots that I want.  Read the details carefully to understand and benefit fully from the code.

Read further to learn how to create these violin plots that combine box plots with kernel density plots!  Be careful – the syntax is more complicated than usual!

## Guest Blogging on Simon Fraser University’s Career Services Informer – Opening Doors In Your Job Search With Statistics & Data Analysis

I am very pleased to announce my return as a guest blogger for Simon Fraser University’s (SFU’s) Career Services Informer (CSI), which published my first post earlier this week on how knowing statistics and data analysis can open doors in today’s job market.  During my undergraduate studies at SFU, I volunteered as a Career Peer Educator at its Career Services Centre for 5 years; I was very fortunate to be a part of a great team of fellow volunteers and staff members who provided career advice for students and alumni on writing cover letters and résumés, preparing for job interviews, and networking to build one’s career and search for jobs.  Soon after the Career Services Informer – a career-oriented blog – was created, I began blogging about my work experiences during my undergraduate degree in a series called Eric’s Corner.  This fruitful series lasted for 3 years; I took the next few years to focus on finishing my undergraduate degree and moving to Toronto to pursue my graduate studies – all centred on my very rapid and sharp change toward statistics.

Now that I have graduated from my Master’s degree in statistics from the University of Toronto and worked at Predictum for over a year, Jo-Anne Nadort, one of my former supervisors in the Career Peer Education program and a Career Advisor in SFU’s Career Services Centre, invited me to blog for the CSI again.  It has been a pleasure to work with CSI’s editor, David Lindskoog, on my first post, which focused on the value of being proficient statistics and data analysis in today’s information- and data-driven economy.  I will continue to share my advice and experiences about working and succeeding in STEM (science, technology, engineering, and math) in future posts in Eric’s Corner.  Stay tuned!

## Exploratory Data Analysis: Kernel Density Estimation – Conceptual Foundations

For the sake of brevity, this post has been created from the first half of a previous long post on kernel density estimation.  This first half focuses on the conceptual foundations of kernel density estimationThe second half will focus on constructing kernel density plots and rug plots in R.

#### Introduction

Recently, I began a series on exploratory data analysis; so far, I have written about computing descriptive statistics and creating box plots in R for a univariate data set with missing values.  Today, I will continue this series by introducing the underlying concepts of kernel density estimation, a useful non-parametric technique for visualizing the underlying distribution of a continuous variable.  In the follow-up post, I will show how to construct kernel density estimates and plot them in R.  I will also introduce rug plots and show how they can complement kernel density plots.

A follow-up post will show how to construct these kernel density plots in R!

But first – read the rest of this post to learn the conceptual foundations of kernel density estimation.

## Don’t Take Good Data for Granted: A Caution for Statisticians

#### Background

Yesterday, I had the pleasure of attending my first Spring Alumni Reunion at the University of Toronto.  (I graduated from its Master of Science program in statistics in 2012.)  There were various events for the alumni: attend interesting lectures, find out about our school’s newest initiatives, and meet other alumni in smaller gatherings tailored for particular groups or interests.  The event was very well organized and executed, and I am very appreciative of my alma mater for working so hard to include us in our university’s community beyond graduation.  Most of the attendees graduated 20 or more years ago; I met quite a few who graduated in the 1950′s and 1960′s.  It was quite interesting to chat with them over lunch and during breaks to learn about what our school was like back then.  (Incidentally, I did not meet anyone who graduated in the last 2 years.)

#### A Thought-Provoking Lecture

My highlight at the reunion event was attending Joseph Wong‘s lecture on poverty, governmental welfare programs, developmental economics in poor countries, and social innovation.  (He is a political scientist at UToronto, and you can find videos of him discussing his ideas on Youtube.)  Here are a few of his key ideas that I took away; note that these are my interpretations of what I can remember from the lecture, so they are not transcriptions or even paraphrases of his exact words:

1. Many workers around the world are not documented by official governmental records.  This is especially true in developing countries, where the nature of the employer-employee relationship (e.g. contractual work, temporary work, unreported labour) or the limitations of the survey/sampling methods make many of these “invisible workers” unrepresented.  Wong argues that this leads to inequitable distribution of welfare programs that aim to re-distribute wealth.
2. Social innovation is harnessing knowledge to create an impact.  It often does NOT involve inventing a new technology, but actually combining, re-combining, or arranging existing knowledge and technologies to solve a social problem in an innovative way.  Wong addressed this in further detail in a recent U of T News article.
3. Poor people will not automatically flock to take advantage of a useful product or service just because of a decrease in price.  Sometimes, substantial efforts and intelligence in marketing are needed to increase the quantity demanded.  A good example is the Tata Nano, a small car that was made and sold in India with huge expectations but underwhelming success.
4. Poor people often need to mitigate a lot of risk, and that can have a significant and surprising effect on their behaviour in response to the availability of social innovations.  For example, a poor person may forgo a free medical treatment or diagnostic screening if he/she risks losing a job or a business opportunity by taking the time away from work to get that treatment/screening.  I asked him about the unrealistic assumptions that he often sees in economic models based on his field work, and he notes that absence of risk (e.g. in cost functions) as one such common unrealistic assumption.

#### The Importance of Checking the Quality of the Data

These are all very interesting points to me in their own right.  However, Point #1 is especially important to me as a statistician.  During my Master’s degree, I was warned that most data sets in practice are not immediately ready for analysis, and substantial data cleaning is needed before any analysis can be done; data cleaning can often take 80% of the total amount of time in a project.  I have seen examples of this in my job since finishing my graduate studies a little over a year ago, and I’m sure that I will see more of it in the future.

Even before cleaning the data, it is important to check how the data were collected.  If sampling or experimental methods were used, it is essential to check if they were used or designed properly.  It would be unsurprising to learn that many bureaucrats, policy makers, and elected officials have used unreliable labour statistics to guide all kinds of economic policies on business, investment, finance, welfare, and labour – let alone the other non-economic justifications and factors, like politics, that cloud and distort these policies even further.

We statisticians have a saying about data quality: “garbage in – garbage out”.  If the data are of poor quality, then any insights derived from analyzing those data are useless, regardless of how good the analysis or the modelling technique is.  As a statistician, I cannot take good data for granted, and I aim to be more vigilant about the quality and the source of the data before I begin to analyze them.

## Exploratory Data Analysis: Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

#### Introduction

Last week, I wrote the first post in a series on exploratory data analysis (EDA).  I began by calculating summary statistics on a univariate data set of ozone concentration in New York City in the built-in data set “airquality” in R.  In particular, I talked about how to calculate those statistics when the data set has missing values.  Today, I continue this series by creating box plots in R and showing different variations and extensions that can be added; be sure to examine the details of this post’s R code for some valuable details.  I learned many of these tricks from Robert Kabacoff’s “R in Action” (2011).  Robert also has a nice blog called Quick-R that I consult often.

Recall that I abstracted a vector called “ozone” from the data set “airquality”.

ozone = airquality\$Ozone

#### Box Plots – What They Represent

The simplest box plot can be obtained by using the basic settings in the boxplot() command.  As usual, I use png() and dev.off() to print the image to a local folder on my computer.

png('INSERT YOUR DIRECTORY HERE/box plot ozone.png')
boxplot(ozone, ylab = 'Ozone (ppb)', main = 'Box Plot of Ozone in New York')
dev.off()

## Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

#### Introduction

This is the first of a series of posts on exploratory data analysis (EDA).  This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R.  This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets.  In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.

## Spatial Statistics Seminar in Toronto – Tuesday, May 21, 2013 @ SAS Canada Headquarters

I volunteer with the Southern Ontario Regional Association (SORA) of the Statistical Society of Canada (SSC) to organize a seminar series on business analytics here in Toronto.  The final seminar of the 2012-2013 series will be held on Tuesday, May 21 at SAS Canada Headquarters.  If you’re interested in attending, please email seminar.sora@gmail.com with the following subject to get a confirmation: Registration: Seminar by BBM Canada

Speakers: Derrick Gray and Ricardo Gomez-Insausti – BBM Canada

Title: The Power of the Latitude and Longitude – An Application of Spatial Techniques to Audience Measurement Data

Date: Tuesday, May 21, 2013

Location:

Suite 500

280 King Street East

Toronto

Networking: 2:00 – 2:30 pm

Introductory Remarks: 2:30 – 2:45 pm

Seminar Time: 2:45 – 3:45 pm

Discussion and Networking: 3:45 – 5:00 pm

Read the entire post to see the abstract and the speakers’ biographies.

## Webinar – Advanced Predictive Modelling for Manufacturing

The company that I work for, Predictum, is about to begin a free webinar series on statistics and analytics, and I will present the first one on Tuesday, May 14, at 2 pm EDT.  This first webinar will focus on how partial least squares regression can be used as a predictive modelling technique; the data sets are written in the context of manufacturing, but it is definitely to all industries that need techniques beyond basic statistical tools like linear regression for predictive modelling.  JMP, a software that Predictum uses extensively, will be used to illustrate how partial least squares regression can be implemented.  This presentation will not be heavy in mathematical detail, so it will be accessible to a wide audience, including statisticians, analysts, managers, and executives.

Attend my company’s free webinar to listen to me talking about advanced predictive modelling and partial least squares regression!

To register for this free webinar, visit the webinar’s registration page on Webex.

## Presentation Slides – Overcoming Multicollinearity and Overfitting with Partial Least Squares Regression in JMP and SAS

My slides on partial least squares regression at the Toronto Area SAS Society (TASS) meeting on September 14, 2012, can be found here.

#### My Presentation on Partial Least Squares Regression

My first presentation to Toronto Area SAS Society (TASS) was delivered on September 14, 2012.  I introduced a supervised learning/predictive modelling technique called partial least squares (PLS) regression; I showed how normal linear least squares regression is often problematic when used with big data because of multicollinearity and overfitting, explained how partial least squares regression overcomes these limitations, and illustrated how to implement it in SAS and JMP.  I also highlighted the variable importance for projection (VIP) score that can be used to conduct variable selection with PLS regression; in particular, I documented its effectiveness as a technique for variable selection by comparing some key journal articles on this issue in academic literature.

The green line is an overfitted classifier.  Not only does it model the underlying trend, but it also models the noise (the random variation) at the boundary.  It separates the blue and the red dots perfectly for this data set, but it will classify very poorly on a new data set from the same population.

## How to Calculate a Partial Correlation Coefficient in R: An Example with Oxidizing Ammonia to Make Nitric Acid

#### Introduction

Today, I will talk about the math behind calculating partial correlation and illustrate the computation in R with an example involving the oxidation of ammonia to make nitric acid using a built-in data set in R called stackloss.  In a separate post, I will also share an R function that I wrote to estimate partial correlation.  In a later post, I will discuss the interpretation of the partial correlation coefficient at length.

I read Pages 234-237 in Section 6.6 of “Discovering Statistics Using R” by Andy Field, Jeremy Miles, and Zoe Field to learn about partial correlation.  They used a data set called “Exam Anxiety.dat” available from their companion web site (look under “6 Correlation”) to illustrate this concept; they calculated the partial correlation coefficient between exam anxiety and revision time while controlling for exam score.  As I discuss further below, the plot between the 2 above residuals helps to illustrate the calculation of partial correlation coefficients.  This plot makes intuitive sense; if you take more time to study for an exam, you tend to have less exam anxiety, so there is a negative correlation between revision time and exam anxiety.

They used a function called pcor() in a package called “ggm”; however, I suspect that this package is no longer working properly, because it depends on a deprecated package called “RBGL” (i.e. “RBGL” is no longer available in CRAN).  See this discussion thread for further information.  Thus, I wrote my own R function to illustrate partial correlation.

Partial correlation is the correlation between 2 random variables while holding other variables constant.  To calculate the partial correlation between X and Y while holding Z constant (or controlling for the effect of Z, or averaging out Z),

## Using the Golden Section Search Method to Minimize the Sum of Absolute Deviations

#### Introduction

Recently, I introduced the golden search method – a special way to save computation time by modifying the bisection method with the golden ratio – and I illustrated how to minimize a cusped function with this script.  I also wrote an R function to implement this method and an R script to apply this method with an example.  Today, I will use apply this method to a statistical topic: minimizing the sum of absolute deviations with the median.

While reading Page 148 (Section 6.3) in Michael Trosset’s “An Introduction to Statistical Inference and Its Applications”, I learned 2 basic, simple, yet interesting theorems.

If X is a random variable with a population mean $\mu$ and a population median $q_2$, then

a) $\mu$ minimizes the function $f(c) = E[(X - c)^2]$

b) $q_2$ minimizes the function $h(c) = E(|X - c|)$

I won’t prove these theorems in this blog post (perhaps later), but I want to use the golden section search method to show a result similar to b):

c) The sample median, $\tilde{m}$, minimizes the function

$g(c) = \sum_{i=1}^{n} |X_i - c|$.

This is not surprising, of course, since

- $|X - c|$ is just a function of the random variable $X$

- by the law of large numbers,

$\lim_{n\to \infty}\sum_{i=1}^{n} |X_i - c| = E(|X - c|)$

Thus, if the median minimizes $E(|X - c|)$, then, intuitively, it minimizes $\lim_{n\to \infty}\sum_{i=1}^{n} |X_i - c|$.  Let’s show this with the golden section search method, and let’s explore any differences that may arise between odd-numbered and even-numbered data sets.