## A macro to automate the creation of indicator variables in SAS

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS.  This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding.  I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line.  If you have not read my previous post on this topic, then I strongly encourage you to do that first.  Don’t use this macro blindly.

Here is the macro.  The key steps are

1. Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
2. Merge the original data with the newly created indicator variables
3. Delete the “INDICATORS” data set, which was created in an intermediate step
%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
data = &input_data
noprint
outdesign = indicators;
class &covariates / param = glm;
model &target = &covariates;
run;

data &output_data;
merge    &input_data
indicators (drop = Intercept &target);
run;

proc datasets
library = work
noprint;
delete indicators;
run;

%mend;

I will use the built-in data set SASHELP.CARS to illustrate the use of my macro.  As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables.  I will do that here for the variables TYPE, MAKE, and ORIGIN.

## An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

#### Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

• They can be useful predictor variables in statistical models.
• They can reduce the amount of memory required to store the data set.
• They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process.  This is especially true if there are many categorical variables, or if a categorical variable has many categories.  In this tutorial, I will show an easy and efficient way to create indicator variables in SAS.  I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

#### The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library.  Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately.  (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY STATE COUNTY ACTUAL PREDICT
U.S.A. California $987.36$692.24
U.S.A. California $1,782.96$568.48
U.S.A. California $32.64$16.32
U.S.A. California $1,825.12$756.16
U.S.A. California $750.72$723.52

PRODTYPE PRODUCT YEAR QUARTER MONTH MONYR
FURNITURE SOFA 1995 1 Jan JAN95
FURNITURE SOFA 1995 1 Feb FEB95
FURNITURE SOFA 1995 1 Mar MAR95
FURNITURE SOFA 1995 2 Apr APR95
FURNITURE SOFA 1995 2 May MAY95

## Performing Logistic Regression in R and SAS

#### Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

#### The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
2. treatment – Whether or not the patient completed an anger control treatment program.
3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

## Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

• its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

$\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}$

• its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

$\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}$

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

## Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals.  (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.)  Today, I want to share a second video of calculating expected counts – this time, using joint probabilities.  This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table.  Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample.  This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my YouTube channel to gain a deeper understanding of how and why they can be calculated.  Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

## Video Tutorial – Allelic Frequencies Remain Constant From Generation to Generation Under the Hardy-Weinberg Equilibrium

The Hardy-Weinberg law is a fundamental principle in statistical genetics.  If its 7 assumptions are fulfilled, then it predicts that the allelic frequency of a genetic trait will remain constant from generation to generation.  In this new video tutorial in my Youtube channel, I explain the math behind the Hardy-Weinberg theorem.  In particular, I clarify the origin of the connection between allelic frequencies and genotyopic frequencies in the second generation – I have not found a single textbook or web site on this topic that explains this calculation, so I hope that my explanation is helpful to you.

## Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables.  The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily.  In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals.  In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

## Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves

A receiver operating characteristic (ROC) curve is a 2-dimensional plot of the $\text{Sensitivity}$ (the true positive rate) versus $1 - \text{Specificity}$ (1 minus the true negative rate) of a binary classifier while varying its discrimination threshold.  In statistics and machine learning, a basic and popular tool for binary classification is logistic regression, and an ROC curve is a useful way to assess the predictive accuracy of the logistic regression model.

To illustrate with an example, let’s consider the Bernoulli response variable $Y$ and the covariates $X_1, X_2, ..., X_p$.  A logistic regression model takes the covariates as inputs and returns $P(Y = 1)$.  You as the user of the model must decide above which value of $P(Y = 1)$ you will predict that $Y = 1$; this value is the discrimination threshold.  A common threshold is $P(Y = 1) = 0.5$.

Once you finish fitting the model with a training set, you can construct an ROC curve by following these steps below:

1. Set a discrimination threshold.
2. Use the covariates to predict $Y$ for each observation in a validation set.
3. Since you have the actual response values in the validation set, you can then calculate the sensitivity and specificity for your logistic regression model at that threshold.
4. Repeat Steps 1-3 with a new threshold.
5. Plot the values of $\text{Sensitivity}$ versus $1 - \text{Specificity}$ for all thresholds.  The result is your ROC curve.

The use of a validation set to assess the predictive accuracy of a model is called validation, and it is a good practice for supervised learning.  If you have another fresh data set, it is also good practice to use that as a test set to assess the predictive accuracy of your model.

Note that you can perform Steps 2-5 for the training set, too – this is often done in statistics when you don’t have many data to work with, and the best that you can do is to assess the predictive accuracy of your model on the data set that you used to fit the model.

## Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of

$h(t) = f(t) \div S(t)$.

In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity to be written as a function of the other.

I am excited to continue adding to my Youtube channel‘s collection of video tutorials.  Please stay tuned for more!

## Presentation on Statistical Genetics at Vancouver SAS User Group – Wednesday, May 28, 2014

I am excited and delighted to be invited to present at the Vancouver SAS User Group‘s next meeting.  I will provide an introduction to statistical genetics; specifically, I will

• define basic terminology in genetics
• explain the Hardy-Weinberg equilibrium in detail
• illustrate how Pearson’s chi-squared goodness-of-fit test can be used in PROC FREQ in SAS to check the Hardy-Weinberg equilibrium
• illustrate how the Newton-Raphson algorithm can be used for maximum likelihood estimation in PROC IML in SAS

You can register for this meeting here.  The meeting’s coordinates are

9:00am – 3:00pm

Wednesday, May 28th, 2014

BC Cancer Agency Research Centre

675 West 10th Avenue.

Vancouver, BC

If you will attend this meeting, please feel free to come up and say “Hello!”.  I look forward to meeting you!

## Machine Learning and Applied Statistics Lesson of the Day – Sensitivity and Specificity

To evaluate the predictive accuracy of a binary classifier, two useful (but imperfect) criteria are sensitivity and specificity.

Sensitivity is the proportion of truly positives cases that were classified as positive; thus, it is a measure of how well your classifier identifies positive cases.  It is also known as the true positive rate.  Formally,

$\text{Sensitivity} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives + Number of False Negatives)}$

Specificity is the proportion of truly negative cases that were classified as negative; thus, it is a measure of how well your classifier identifies negative cases.  It is also known as the true negative rate.  Formally,

$\text{Specificity} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives + Number of False Positives)}$

## Video Tutorial: Breaking Down the Definition of the Hazard Function

The hazard function is a fundamental quantity in survival analysis.  For an event occurring at some time on a continuous time scale, the hazard function, $h(t)$, for that event is defined as

$h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t]$,

where

• $t$ is the time,
• $X$ is the time of the occurrence of the event.

However, what does this actually mean?  In this Youtube video, I break down the mathematics of this definition into its individual components and explain the intuition behind each component.

I am very excited about the release of this first video in my new Youtube channel!  This is yet another mode of expansion of The Chemical Statistician since the beginning of 2014.  As always, your comments are most appreciated!