Categorical Data Analysis | The Chemical Statistician

A macro to automate the creation of indicator variables in SAS

April 25, 2018 Leave a comment

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS. This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding. I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line. If you have not read my previous post on this topic, then I strongly encourage you to do that first. Don’t use this macro blindly.

Here is the macro. The key steps are

Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
Merge the original data with the newly created indicator variables
Delete the “INDICATORS” data set, which was created in an intermediate step

%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
     data = &input_data
          noprint
          outdesign = indicators;
     class &covariates / param = glm;
     model &target = &covariates;
run;


data &output_data;
      merge    &input_data
               indicators (drop = Intercept &target);
run;


proc datasets 
     library = work
          noprint;
     delete indicators;
run;

%mend;

I will use the built-in data set SASHELP.CARS to illustrate the use of my macro. As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables. I will do that here for the variables TYPE, MAKE, and ORIGIN.

Read more of this post

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Data Analysis, SAS Programming, Statistics, Tutorials Tagged with categorical data, Categorical Data Analysis, categorical variable, data analysis, dummy coding, dummy variables, indicator, indicator variable, indicator variables, indicators, SAS, sas programming, statistics

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

April 24, 2018 Leave a comment

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

They can be useful predictor variables in statistical models.
They can reduce the amount of memory required to store the data set.
They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process. This is especially true if there are many categorical variables, or if a categorical variable has many categories. In this tutorial, I will show an easy and efficient way to create indicator variables in SAS. I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library. Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately. (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY	STATE	ACTUAL	PREDICT
U.S.A.	California	$987.36	$692.24
U.S.A.	California	$1,782.96	$568.48
U.S.A.	California	$32.64	$16.32
U.S.A.	California	$1,825.12	$756.16
U.S.A.	California	$750.72	$723.52

PRODTYPE	PRODUCT	YEAR	QUARTER	MONTH	MONYR
FURNITURE	SOFA	1995	1	Jan	JAN95
FURNITURE	SOFA	1995	1	Feb	FEB95
FURNITURE	SOFA	1995	1	Mar	MAR95
FURNITURE	SOFA	1995	2	Apr	APR95
FURNITURE	SOFA	1995	2	May	MAY95

Read more of this post

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

August 16, 2017 5 Comments

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected. Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;
     set sashelp.class;
 
     * define a new character variable to classify someone as tall or short;
     if height > 60
     then height_class = 'Tall';
          else height_class = 'Short';
run;


* print the results for the first 5 rows;
proc print
     data = c1 (obs = 5);
run;

Here is the result:

Obs	Name	Sex	Age	Height	Weight	height_class
1	Alfred	M	14	69.0	112.5	Tall
2	Alice	F	13	56.5	84.0	Shor
3	Barbara	F	13	65.3	98.0	Tall
4	Carol	F	14	62.8	102.5	Tall
5	Henry	M	14	63.5	102.5	Tall

What happened? Why does the word “Short” render as “Shor”?

Read more of this post

Filed under Categorical Data Analysis, Data Analysis, R programming, SAS Programming, Statistics, Tutorials Tagged with categorical data, categorical variable, character data, character variable, length(), R, r programing, SAS, sas programming

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

August 12, 2015 8 Comments

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history. CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event. The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.” (Emphases added)

odds of all 15 home teams winning on same day

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result. This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

$P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]$

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

$P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)$

$P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757$

Now, let’s connect this probability to odds.

It is important to note that

odds is only applicable to Bernoulli random variables (i.e. binary events)
odds is the ratio of the probability of success to the probability of failure

For our example,

$\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185$

The above article states that the odds is 1 in 32,768. The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated. Instead, 0.00003051757 is the probability of all 15 home teams winning. Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make. Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences. Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Mathematical Statistics, Mathematics, Probability, Statistics, Statistics in Industry and Practice, Tutorials Tagged with baseball, math, median, mlb, odds, probability, statistics, statistics communication

The advantages of using count() to get N-way frequency tables as data frames in R

February 12, 2015 5 Comments

Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R. Several commenters provided alternative ways of doing so, and they are all appreciated. Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation. Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
          gear
 cyl      3     4     5
 4        1     8     2
 6        2     4     1
 8        12    0     2

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with count, cross-tabulation, data analysis, frequency table, R, R programming, statistics, table(), xtabs()

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

February 3, 2015 32 Comments

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with categorical variable, class(), count, data frame, factor, frequency table, install.packages(0, mtcars, names(), plyr, R, R programming, subset, table()

Performing Logistic Regression in R and SAS

November 24, 2014 3 Comments

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression. Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far. Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression. I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day. In this post, I will show how to perform logistic regression in both R and SAS. I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression. I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

In R, I imported it using the “XLConnect” package.
In SAS, I imported it using PROC IMPORT.

This data set has 3 variables (I have renamed them for convenience in my R programming).

ha2 – Whether or not a patient had a second heart attack. If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack. This is the response variable.
treatment – Whether or not the patient completed an anger control treatment program.
anxiety – A continuous variable that scores the patient’s anxiety level. A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Read more of this post

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, R programming, Statistics, Tutorials Tagged with applied statistics, binary classification, deviance residual, deviance residuals, Excel, fisher scoring, fisher scoring algorithm, logistic regression, null deviance, ods graphics, ods pdf, proc import, proc logistic, R, R programming, regression, residual deviance, SAS, sas programming, statistics

The Chi-Squared Test of Independence – An Example in Both R and SAS

August 25, 2014 6 Comments

Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. Given 2 categorical random variables, $X$ and $Y$ , the chi-squared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test with the following null and alternative hypotheses:

$H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y$

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities. For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, R programming, SAS Programming, Statistics, Tutorials Tagged with applied statistics, chi-squared test of independence, DATA step, degree of freedom, expected count, expected counts, number of degrees of freedom, observed count, observed counts, p-value, PROC FREQ, R, SAS, statistics

Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

August 7, 2014 Leave a comment

For a binary classifier,

its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

$\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}$

its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

$\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}$

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier. Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics, Statistics Lesson of the Day Tagged with applied statistics, binary classification, binary classifier, Categorical Data Analysis, line of no discrimination, machine learning, negative predictive value, npv, positive predictive value, ppv, ROC curve, sensitivity, specificity, statistics

Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

August 4, 2014 Leave a comment

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals. (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.) Today, I want to share a second video of calculating expected counts – this time, using joint probabilities. This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table. Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample. This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my YouTube channel to gain a deeper understanding of how and why they can be calculated. Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Statistics, Tutorials, Video Tagged with Categorical Data Analysis, categorical random variable, categorical variable, contingency table, expected count, joint probability, math, probability, statistics; biostatistics

Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

July 7, 2014 2 Comments

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables. The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily. In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals. In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Statistics, Tutorials, Video Tagged with biostatistics, Categorical Data Analysis, categorical random variable, chi-squared test of independence, contingency table, dependence, expected count, independence, marginal proportion, marginal total, statistics, video, video tutorial

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

A macro to automate the creation of indicator variables in SAS

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

The Example Data Set

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

The advantages of using count() to get N-way frequency tables as data frames in R

Introduction

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

The Example Data Set – mtcars

Performing Logistic Regression in R and SAS

Introduction

The Data Set

The Chi-Squared Test of Independence – An Example in Both R and SAS

Introduction

Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories