Applied Statistics | The Chemical Statistician

Video Tutorial: Naive Bayes Classifiers

October 3, 2018 Leave a comment

Naive Bayes classifiers are simple but powerful tools for classification in statistics and machine learning. In this video tutorial, I use a simulated data set and illustrate the mathematical details of how this technique works.

In my recent episode on The Central Equilibrium about word embeddings and text classification, Mandy Gu used naive Bayes classifiers to determine if a sentence is toxic or non-toxic – a very common objective when moderating discussions in online forums. If you are not familiar with naive Bayes classifiers, then I encourage you to watch this video first before watching Mandy’s episode on The Central Equilibrium.

Filed under Applied Mathematics, Applied Statistics, Data Mining, Machine Learning, Mathematics, Predictive Modelling, Statistics, Tutorials, Video Tagged with machine learning, naive bayes classifier, naive bayes classifiers, statistics, tutorial, video

Mandy Gu on Word Embeddings and Text Classification – The Central Equilibrium – Episode 9

September 26, 2018 Leave a comment

I am so grateful to Mandy Gu for being a guest on The Central Equilibrium to talk about word embeddings and text classification. She began by showing how data from text can be encoded in vectors and matrices, and then she used a naive Bayes classifier to classify sentences as toxic or non-toxic – a very common problem for moderating discussions in online forums. I learned a lot from her in this episode, and you can learn more from Mandy on her Medium blog.

If you are not familiar with naive Bayes classifiers, then I encourage you to watch my video tutorial about this topic first.

Filed under Applied Mathematics, Applied Statistics, Machine Learning, Mathematics, Statistics, The Central Equilibrium, Tutorials, Video Tagged with machine learning, mandy gu, math, mathematics, naive bayes classifier, statistics, text classification, text mining, The Central Equilibrium, video, word embedding, word embeddings

Arnab Chakraborty on The Monty Hall Problem and Bayes’ Theorem – The Central Equilibrium – Episode 6

July 21, 2018 Leave a comment

I am pleased to welcome Arnab Chakraborty back to my talk show, “The Central Equilibrium“, to talk about the Monty Hall Problem and Bayes’ theorem. In this episode, he shows 2 solutions to this classic puzzle in probability, and invokes Bayes’ Theorem for the second solution.

If you have not watched Arnab’s first episode on Bayes’ theorem, then I encourage you to do that first.

Marilyn Vos Savant provided a solution to this problem in PARADE Magazine in 1990-1991. Thousands of readers disagreed with her solution and criticized her vehemently (and incorrectly) for her error. Some of these critics were mathematicians! She included some of those replies and provided alternative perspectives that led to the same conclusion. Although I am dismayed by the disrespect that some people showed in their letters to her, I am glad that a magazine column on probability was able to attract so much readership and interest. Arnab and I referred to one of her solutions in our episode. Thank you, Marilyn!

Enjoy this episode of “The Central Equilibrium“!

Filed under Applied Statistics, Mathematics, Probability, Statistics, The Central Equilibrium, Video Tagged with Bayes' Theorem, math, mathematics, monty hall problem, probability, statistics, The Central Equilibrium

Layne Newhouse on representing neural networks – The Central Equilibrium – Episode 4

June 28, 2018 Leave a comment

I am excited to present the first of a multi-episode series on neural networks on my talk show, “The Central Equilibrium”. My guest in this series in Layne Newhouse, and he talked about how to represent neural networks. We talked about the biological motivations behind neural networks, how to represent them in diagrams and mathematical equations, and a few of the common activation functions for neural networks.

Check it out!

Filed under Applied Statistics, Data Mining, Machine Learning, Mathematics, Statistics, The Central Equilibrium, Video Tagged with activation function, activation functions, hyperbolic tangent function, layne newhouse, logistic function, machine learning, math, mathematics, neural network, neural networks, rectifier linear unit, relu, statistics, The Central Equilibrium, video

A macro to execute PROC TTEST for multiple binary grouping variables in SAS (and sorting t-test statistics by their absolute values)

May 4, 2018 Leave a comment

In SAS, you can perform PROC TTEST for multiple numeric variables in the same procedure. Here is an example using the built-in data set SASHELP.BASEBALL; I will compare the number of at-bats and number of walks between the American League and the National League.

proc ttest
     data = sashelp.baseball;
     class League;
     var nAtBat nBB; 
     ods select ttests;
run;

Here are the resulting tables.

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	320	2.05	0.0410
Satterthwaite	Unequal	313.66	2.06	0.04

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	320	0.85	0.3940
Satterthwaite	Unequal	319.53	0.86	0.3884

What if you want to perform PROC TTEST for multiple grouping (a.k.a. classification) variables? You cannot put more than one variable in the CLASS statement, so you would have to run PROC TTEST separately for each binary grouping variable. If you do put LEAGUE and DIVISION in the same CLASS statement, here is the resulting log.

1303 proc ttest
1304 data = sashelp.baseball;
1305 class league division;
 --------
 22
 202
ERROR 22-322: Expecting ;.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
1306 var natbat;
1307 ods select ttests;
1308 run;

There is no syntax in PROC TTEST to use multiple grouping variables at the same time, so this tutorial provides a macro to do so. There are several nice features about my macro:

It allows you to use multiple grouping variables at the same time.
It sorts the t-test statistics by their absolute values within each grouping variable.
It shows the name of each continuous variable in the output table, unlike the above output.

Here is its basic skeleton.

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, SAS Programming, Statistics, Tutorials Tagged with applied statistics, data analysis, do loop, macro, proc ttest, SAS, sas macro, sashelp.baseball, statistics, Student's t-test, t-test

A macro to automate the creation of indicator variables in SAS

April 25, 2018 Leave a comment

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS. This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding. I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line. If you have not read my previous post on this topic, then I strongly encourage you to do that first. Don’t use this macro blindly.

Here is the macro. The key steps are

Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
Merge the original data with the newly created indicator variables
Delete the “INDICATORS” data set, which was created in an intermediate step

%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
     data = &input_data
          noprint
          outdesign = indicators;
     class &covariates / param = glm;
     model &target = &covariates;
run;


data &output_data;
      merge    &input_data
               indicators (drop = Intercept &target);
run;


proc datasets 
     library = work
          noprint;
     delete indicators;
run;

%mend;

I will use the built-in data set SASHELP.CARS to illustrate the use of my macro. As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables. I will do that here for the variables TYPE, MAKE, and ORIGIN.

Read more of this post

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Data Analysis, SAS Programming, Statistics, Tutorials Tagged with categorical data, Categorical Data Analysis, categorical variable, data analysis, dummy coding, dummy variables, indicator, indicator variable, indicator variables, indicators, SAS, sas programming, statistics

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

April 24, 2018 Leave a comment

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

They can be useful predictor variables in statistical models.
They can reduce the amount of memory required to store the data set.
They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process. This is especially true if there are many categorical variables, or if a categorical variable has many categories. In this tutorial, I will show an easy and efficient way to create indicator variables in SAS. I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library. Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately. (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY	STATE	ACTUAL	PREDICT
U.S.A.	California	$987.36	$692.24
U.S.A.	California	$1,782.96	$568.48
U.S.A.	California	$32.64	$16.32
U.S.A.	California	$1,825.12	$756.16
U.S.A.	California	$750.72	$723.52

PRODTYPE	PRODUCT	YEAR	QUARTER	MONTH	MONYR
FURNITURE	SOFA	1995	1	Jan	JAN95
FURNITURE	SOFA	1995	1	Feb	FEB95
FURNITURE	SOFA	1995	1	Mar	MAR95
FURNITURE	SOFA	1995	2	Apr	APR95
FURNITURE	SOFA	1995	2	May	MAY95

Read more of this post

Arnab Chakraborty on Bayes’ Theorem – The Central Equilibrium – Episode 3

December 18, 2017 Leave a comment

Arnab Chakraborty kindly came to my new talk show, “The Central Equilibrium”, to talk about Bayes’ theorem. He introduced the concept of conditional probability, stated Bayes’ theorem in its simple and general forms, and showed an example of how to use it in a calculation.

Check it out!

Filed under Applied Statistics, Mathematical Statistics, Probability, Statistics, The Central Equilibrium, Video Tagged with Bayes' Theorem, math, mathematical statistics, mathematics, probability, statistics, The Central Equilibrium

Christopher Salahub on Markov Chains – The Central Equilibrium – Episode 2

September 11, 2017 1 Comment

It was a great pleasure to talk to Christopher Salahub about Markov chains in the second episode of my new talk show, The Central Equilibrium! Chris graduated from the University of Waterloo with a Bachelor of Mathematics degree in statistics. He just finished an internship in data development at Environics Analytics, and he is starting a Master’s program in statistics at ETH Zurich in Switzerland.

Chris recommends “Introduction to Probability Models” by Sheldon Ross to learn more about probability theory and Markov chains.

The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams. Stay tuned for more episodes in the coming weeks!

You can watch all of my videos on my YouTube channel!

Please watch the video on this blog. You can also watch it directly on YouTube.

Filed under Applied Mathematics, Applied Statistics, Mathematical Statistics, Mathematics, Probability, Statistics, The Central Equilibrium, Video Tagged with chris salahub, christopher salahub, markov, markov chains, math, mathematics, probability, statistics, The Central Equilibrium

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

September 8, 2017 Leave a comment

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting. I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems. This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data set. I will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements. To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data = sashelp.baseball
          noprint
     out = bvars (keep = name type);
run;

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, SAS Programming, Statistics Tagged with applied statistics, correlation, correlation coefficient, data analysis, data manipulation, into statement, macro, macro variable, PROC SQL, programming, SAS, sas programming, SQL, statistics

Sorting correlation coefficients by their magnitudes in a SAS macro

March 21, 2017 Leave a comment

Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables. For 2 random variables, $X$ and $Y$ , the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations. Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$ .

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data. The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations. Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$ .

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model. You can do this by

computing the correlation coefficients
obtaining their absolute values
sorting them by their absolute values.

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, Mathematical Statistics, Predictive Modelling, SAS Programming, Statistics, Tutorials Tagged with correlation, macro, pearson correlation, pearson correlation coefficient, predictive modelling, PROC CORR, regression, regression modelling, SAS, sas macro

Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

November 17, 2015 Leave a comment

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips. I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance. This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out! I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series! It is a pleasure to be a guest blogger for JMP!

potato-chips-and-analytical-chemistry-part-2

Filed under Analytical Chemistry, Applied Statistics, Basic Chemistry, Chemistry, Data Analysis, Data Visualization, JMP, Practical Applications of Chemistry, Scientific Applications of Chemistry, Statistics, Tutorials Tagged with analysis of variance, analytical chemistry, ANOVA, chemistry, chips, JMP, potato chips, sample preparation, statistics, sum of squares

Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

November 4, 2015 Leave a comment

I am very excited to write again for the official JMP blog as a guest blogger! Today, the first article of a 2-part series has been published, and it is called “Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP“. This series of blog posts will talk about analysis of variance (ANOVA), sampling, and analytical chemistry, and it uses the quantification of sodium in potato chips as an example to illustrate these concepts.

The first part of this series discusses how to import the data into the JMP and prepare them for ANOVA. Specifically, it illustrates how the “Stack Columns” function is used to transpose the data from wide format to long format.

I will present this at the Vancouver SAS User Group (VanSUG) meeting later today.

Stay tuned for “Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“!

potato-chips-and-analytical-chemistry-part-1

Filed under Analytical Chemistry, Applied Statistics, Basic Chemistry, Chemistry, Data Analysis, JMP, Scientific Applications of Chemistry, Statistics, Statistics in Industry and Practice, Tutorials Tagged with aliquot, aliquots, analysis of variance, ANOVA, chemistry, chips, erlenmeyer flask, JMP, potato chips, sampling, statistics, transpose, transposing data, uncertainty, volumetric flask

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

August 12, 2015 8 Comments

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history. CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event. The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.” (Emphases added)

odds of all 15 home teams winning on same day

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result. This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

$P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]$

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

$P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)$

$P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757$

Now, let’s connect this probability to odds.

It is important to note that

odds is only applicable to Bernoulli random variables (i.e. binary events)
odds is the ratio of the probability of success to the probability of failure

For our example,

$\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185$

The above article states that the odds is 1 in 32,768. The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated. Instead, 0.00003051757 is the probability of all 15 home teams winning. Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make. Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences. Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Mathematical Statistics, Mathematics, Probability, Statistics, Statistics in Industry and Practice, Tutorials Tagged with baseball, math, median, mlb, odds, probability, statistics, statistics communication

Mathematics and Applied Statistics Lesson of the Day – Contrasts

June 19, 2015 3 Comments

A contrast is a linear combination of a set of variables such that the sum of the coefficients is equal to zero. Notationally, consider a set of variables

$\mu_1, \mu_2, ..., \mu_n$ .

Then the linear combination

$c_1 \mu_1 + c_2 \mu_2 + ... + c_n \mu_n$

is a contrast if

$c_1 + c_2 + ... + c_n = 0$ .

There is a reason for why I chose to use $\mu$ as the symbol for the variables in the above notation – in statistics, contrasts provide a very useful framework for comparing multiple population means in hypothesis testing. In a later Statistics Lesson of the Day, I will illustrate some examples of contrasts, especially in the context of experimental design.

Filed under Applied Statistics, Experimental Design, Mathematics, Statistics, Statistics Lesson of the Day Tagged with coefficient, comparison, contrast, experimental design, hypothesis testing, linear combination, treatment mean

The advantages of using count() to get N-way frequency tables as data frames in R

February 12, 2015 5 Comments

Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R. Several commenters provided alternative ways of doing so, and they are all appreciated. Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation. Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
          gear
 cyl      3     4     5
 4        1     8     2
 6        2     4     1
 8        12    0     2

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with count, cross-tabulation, data analysis, frequency table, R, R programming, statistics, table(), xtabs()

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

February 3, 2015 32 Comments

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with categorical variable, class(), count, data frame, factor, frequency table, install.packages(0, mtcars, names(), plyr, R, R programming, subset, table()

Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

December 11, 2014 Leave a comment

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples. Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis! Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

I will return to this blog post to add new links as I write more tutorials.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

Filed under Applied Statistics, Data Analysis, Data Visualization, Descriptive Statistics, R programming, Statistics Tagged with 5-number summary, applied statistics, box plot, data analysis, Data Visualization, ecdf(), empirical cumulative distribution function, exploratory data analysis, five-number summary, frequency table, histogram, kernel density estimation, kernel density plot, quantile, quantile-quantile plot, R, R programming, violin plot

Performing Logistic Regression in R and SAS

November 24, 2014 3 Comments

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression. Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far. Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression. I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day. In this post, I will show how to perform logistic regression in both R and SAS. I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression. I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

In R, I imported it using the “XLConnect” package.
In SAS, I imported it using PROC IMPORT.

This data set has 3 variables (I have renamed them for convenience in my R programming).

ha2 – Whether or not a patient had a second heart attack. If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack. This is the response variable.
treatment – Whether or not the patient completed an anger control treatment program.
anxiety – A continuous variable that scores the patient’s anxiety level. A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Read more of this post

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, R programming, Statistics, Tutorials Tagged with applied statistics, binary classification, deviance residual, deviance residuals, Excel, fisher scoring, fisher scoring algorithm, logistic regression, null deviance, ods graphics, ods pdf, proc import, proc logistic, R, R programming, regression, residual deviance, SAS, sas programming, statistics

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

September 5, 2014 Leave a comment

In 2 recent Statistics Lessons of the Day, I

introduced Markov’s inequality.
explained the motivation and intuition behind Markov’s inequality.

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

Markov’s inequality roughly says that a random variable $X$ is most frequently observed near its expected value, $\mu$ . Remarkably, it quantifies just how often $X$ is far away from $\mu$ . Chebyshev’s inequality goes one step further and quantifies that distance between $X$ and $\mu$ in terms of the number of standard deviations away from $\mu$ . It roughly says that the probability of $X$ being $k$ standard deviations away from $\mu$ is at most $k^{-2}$ . Notice that this upper bound decreases as $k$ increases – confirming our intuition that it is highly improbable for $X$ to be far away from $\mu$ .

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable $X$ , as long as $E(X)$ and $V(X)$ are finite. (Markov’s inequality requires only $E(X)$ to be finite.) This is quite a marvelous result!

Filed under Applied Statistics, Mathematical Statistics, Mathematics, Probability, Statistics, Statistics Lesson of the Day Tagged with applied statistics, Chebyshev's inequality, expectation, expected value, finite expected value, finite variance, Markov's inequality, math, mathematical statistics, probability, standard deviation, statistics, variance

← Older posts

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Video Tutorial: Naive Bayes Classifiers

Mandy Gu on Word Embeddings and Text Classification – The Central Equilibrium – Episode 9

Arnab Chakraborty on The Monty Hall Problem and Bayes’ Theorem – The Central Equilibrium – Episode 6

Layne Newhouse on representing neural networks – The Central Equilibrium – Episode 4

A macro to execute PROC TTEST for multiple binary grouping variables in SAS (and sorting t-test statistics by their absolute values)

A macro to automate the creation of indicator variables in SAS

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

The Example Data Set

Arnab Chakraborty on Bayes’ Theorem – The Central Equilibrium – Episode 3

Christopher Salahub on Markov Chains – The Central Equilibrium – Episode 2

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

Sorting correlation coefficients by their magnitudes in a SAS macro

Theoretical Background

Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Mathematics and Applied Statistics Lesson of the Day – Contrasts

The advantages of using count() to get N-way frequency tables as data frames in R

Introduction

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

The Example Data Set – mtcars

Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

Performing Logistic Regression in R and SAS

Introduction

The Data Set

Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories