Video Tutorial: Naive Bayes Classifiers

Naive Bayes classifiers are simple but powerful tools for classification in statistics and machine learning.  In this video tutorial, I use a simulated data set and illustrate the mathematical details of how this technique works.

In my recent episode on The Central Equilibrium about word embeddings and text classification, Mandy Gu used naive Bayes classifiers to determine if a sentence is toxic or non-toxic – a very common objective when moderating discussions in online forums.  If you are not familiar with naive Bayes classifiers, then I encourage you to watch this video first before watching Mandy’s episode on The Central Equilibrium.

Mandy Gu on Word Embeddings and Text Classification – The Central Equilibrium – Episode 9

I am so grateful to Mandy Gu for being a guest on The Central Equilibrium to talk about word embeddings and text classification.  She began by showing how data from text can be encoded in vectors and matrices, and then she used a naive Bayes classifier to classify sentences as toxic or non-toxic – a very common problem for moderating discussions in online forums.  I learned a lot from her in this episode, and you can learn more from Mandy on her Medium blog.

If you are not familiar with naive Bayes classifiers, then I encourage you to watch my video tutorial about this topic first.

Some SAS procedures (like PROC REG, GLM, ANOVA, SQL, and IML) end with “QUIT;”, not “RUN;”

Most SAS procedures require the

RUN;

statement to signal their termination.  However, there are some notable exceptions to this.

I have written about PROC SQL many times on my blog, and this procedure requires the

QUIT;

It turns out that there is another set of statistical procedures that require the QUIT statement, and some of them are very common.  They are called interactive procedures, and they include PROC REG, PROC GLM, and PROC ANOVAIf you end them with RUN rather than QUIT, then you will run into problems with displaying further output.  For example, if you try to output a data set from one such PROC and end it with the RUN statement, then you will get this error message:

ERROR: You cannot open WORK.MYDATA.DATA for input access with record-level
control because WORK.MYDATA.DATA is in use by you in resource environment
REG.

WORK.MYDATA cannot be opened.

You will also notice that the Program Editor says “PROC … running” in its banner when you end such a PROC with RUN rather than QUIT.

I don’t like this exception, but, alas, it does exist.  You can find out more about these interactive procedures in SAS Usage Note #37105.  As this note says, the ANOVA, ARIMA, CATMOD, FACTEX, GLM, MODEL, OPTEX, PLAN, and REG procedures are interactive procedures, and they all require the QUIT statement for termination.

PROC IML is not mentioned in that usage note, but this procedure also requires the QUIT statement.  Rick Wicklin has written an article about this on his blog, The DO Loop.

Beware of accidental replacement of data sets with PROC SORT in SAS

PROC SORT is a very useful procedure in SAS.  Not only can you sort a data set on one or more variables with it, but you can sort each variable in ascending or descending order, and you can use it to obtain unique observations or duplicated observationsHowever, there is a feature about PROC SORT that can be dangerous and deserves emphasis: If you are not careful, you can accidentally replace an existing, valuable data set.

Suppose that you wish to use PROC SORT to get only the duplicated records of a data set.  Here is an example of how to do it.

data heights;
input Name $Age Height; datalines; Amy 15 174 Amy 16 177 Bob 14 172 Cam 13 163 Cam 17 181 ; run; proc sort data = heights nouniquekey; by Name; run; proc print data = heights; run; Obs Name Age Height 1 Amy 15 174 2 Amy 16 177 3 Cam 13 163 4 Cam 17 181 Note that the record for “Bob” is gone from HEIGHTS, because it was a unique observation and, thus, removed in the above PROC SORT statement. If the original data set is valuable, then this loss can be very damaging, especially if it took a lot of work and time to obtain the original data set. This shows the danger of accidental replacement of a data set in SAS when using PROC SORT. Highlighting cells to quickly view Average, Count, and Sum in Excel I recently needed to check my answers after some data analysis in Alteryx. I computed many averages using a formula in Alteryx, and I wanted to check those results by calculating the average for a few randomly selected rows. I did this by invoking a helpful tool in Microsoft Excel. I will illustrate this functionality with some random data. In Column E, I used a formula to calculate the average of the 3 populations in Columns B, C, and D. To manually check that the formula is correct, I highlighted the 3 columns for ID #125. On the bottom right, Excel calculates the average; it’s difficult to see in the picture below, but Excel confirms that the average is 707,154. Whenever you highlight a range of cells containing numeric data, Excel will provide the average, count, and sum of the selected cells. I did not know about this functionality when I first began working as a statistician, and I am very glad that I did learn it eventually – it is very useful for checking answers by analyzing a few randomly selected rows in Excel! A macro to execute PROC TTEST for multiple binary grouping variables in SAS (and sorting t-test statistics by their absolute values) In SAS, you can perform PROC TTEST for multiple numeric variables in the same procedure. Here is an example using the built-in data set SASHELP.BASEBALL; I will compare the number of at-bats and number of walks between the American League and the National League. proc ttest data = sashelp.baseball; class League; var nAtBat nBB; ods select ttests; run; Here are the resulting tables. Method Variances DF t Value Pr > |t| Pooled Equal 320 2.05 0.0410 Satterthwaite Unequal 313.66 2.06 0.04 Method Variances DF t Value Pr > |t| Pooled Equal 320 0.85 0.3940 Satterthwaite Unequal 319.53 0.86 0.3884 What if you want to perform PROC TTEST for multiple grouping (a.k.a. classification) variables? You cannot put more than one variable in the CLASS statement, so you would have to run PROC TTEST separately for each binary grouping variable. If you do put LEAGUE and DIVISION in the same CLASS statement, here is the resulting log. 1303 proc ttest 1304 data = sashelp.baseball; 1305 class league division; -------- 22 202 ERROR 22-322: Expecting ;. ERROR 202-322: The option or parameter is not recognized and will be ignored. 1306 var natbat; 1307 ods select ttests; 1308 run; There is no syntax in PROC TTEST to use multiple grouping variables at the same time, so this tutorial provides a macro to do so. There are several nice features about my macro: 1. It allows you to use multiple grouping variables at the same time. 2. It sorts the t-test statistics by their absolute values within each grouping variable. 3. It shows the name of each continuous variable in the output table, unlike the above output. Here is its basic skeleton. Use ODS EXCLUDE ALL to suppress printing output in SAS while producing output data sets I regularly produce output data sets from a SAS procedure, such as getting the variable names from a data set in PROC CONTENTS. In these instances, I often wish to suppress any printing of the output in HTML or TXT. Such printing of the results is often unnecessary, and it can cost a lot of time and memory. Some SAS procedures have the NOPRINT option that suppresses the printing of output, but this is limiting in several ways: 1. Some SAS procedures do NOT have the NOPRINT option. PROC TTEST is a prominent example. I checked the high-performance procedures like PROC HPFOREST (random forest) and PROC HPSVM (support vector machine), and I could not find the NOPRINT option for these procedures. 2. I cannot use ODS OUTPUT to produce output data sets while invoking the NOPRINT option. Here is an example. A macro to automate the creation of indicator variables in SAS In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS. This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding. I shared SAS code for how to do so, step-by-step. I write this follow-up post to provide a macro that you can use to execute all of those steps in one line. If you have not read my previous post on this topic, then I strongly encourage you to do that first. Don’t use this macro blindly. Here is the macro. The key steps are 1. Run PROC LOGISTIC to get the design matrix (which has the indicator variables) 2. Merge the original data with the newly created indicator variables 3. Delete the “INDICATORS” data set, which was created in an intermediate step %macro create_indicators(input_data, target, covariates, output_data); proc logistic data = &input_data noprint outdesign = indicators; class &covariates / param = glm; model &target = &covariates; run; data &output_data; merge &input_data indicators (drop = Intercept &target); run; proc datasets library = work noprint; delete indicators; run; %mend; I will use the built-in data set SASHELP.CARS to illustrate the use of my macro. As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables. I will do that here for the variables TYPE, MAKE, and ORIGIN. An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS Introduction In statistics and biostatistics, the creation of binary indicators is a very useful practice. • They can be useful predictor variables in statistical models. • They can reduce the amount of memory required to store the data set. • They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences. However, the creation of indicator variables can be a long, tedious, and error-prone process. This is especially true if there are many categorical variables, or if a categorical variable has many categories. In this tutorial, I will show an easy and efficient way to create indicator variables in SAS. I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set. The Example Data Set Let’s consider the PRDSAL2 data set that is built into the SASHELP library. Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately. (I encourage you to view this data set using PROC PRINT in SAS by yourself.) COUNTRY STATE COUNTY ACTUAL PREDICT U.S.A. California$987.36 $692.24 U.S.A. California$1,782.96 $568.48 U.S.A. California$32.64 $16.32 U.S.A. California$1,825.12 $756.16 U.S.A. California$750.72 $723.52 PRODTYPE PRODUCT YEAR QUARTER MONTH MONYR FURNITURE SOFA 1995 1 Jan JAN95 FURNITURE SOFA 1995 1 Feb FEB95 FURNITURE SOFA 1995 1 Mar MAR95 FURNITURE SOFA 1995 2 Apr APR95 FURNITURE SOFA 1995 2 May MAY95 Sort a data set by ascending or descending variables using PROC SORT in SAS Consider the built-in data set SASHELP.CLASS in SAS. Here are the first 5 observations from PROC PRINT. Obs Name Sex Age Height Weight 1 Joyce F 11 51.3 50.5 2 Thomas M 11 57.5 85.0 3 James M 12 57.3 83.0 4 Jane F 12 59.8 84.5 5 John M 12 59.0 99.5 As you can clearly see, they are NOT sorted by weight. Here is how you can sort the data set by weight using PROC SORT. Video Tutorial – Obtaining the Expected Value of the Exponential Distribution Using the Moment Generating Function In this video tutorial on YouTube, I use the exponential distribution’s moment generating function (MGF) to obtain the expected value of this distribution. Visit my YouTube channel to watch more video tutorials! Remove leading blanks when creating macro variables using PROC SQL in SAS I regularly use PROC SQL to create macro variables in SAS, and I recently noticed a strange phenomenon when resolving a macro variable within double quotation marks in the title of a plot. Thankfully, I was able to replicate this problem using the SASHELP.BASEBALL data set, which is publicly available. I was then able to send the code and the strange result to SAS Technical Support for their examination. proc sql; select count(name) into :hitters_100plusHR from sashelp.baseball where CrHome > 100; quit; proc sgplot data = sashelp.baseball; histogram Salary; title1 'Distribution of salaries'; title2 "Restricted to the &hitters_100plusHR hitters with more than 100 career home runs"; run; Here is the resulting plot. Notice the extra spaces before “72” in the title of the plot. SAS Technical Support informed me that • this problem is commonly known. • there is no way of predicting when it will occur • for now, the best way to deal with it is to remove the leading blanks using one of several ways. Use unique() instead of levels() to find the possible values of a factor in R *In a previous version of this blog post, I incorrectly wrote that “Species” is a character variable. Instead, it is a factor. I thank the readers who corrected me in the comments. When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable. However, I recently noticed something very strange about this function. Consider the built-in data set “iris” and its factor “Species”. Here are the possible values of “Species”, as shown by the levels() function. > levels(iris$Species)

[1] "setosa" "versicolor" "virginica"

Now, let’s remove all rows containing “setosa”.  I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again.

> iris2 = subset(iris, Species != 'setosa')
> table(iris2$Species) setosa versicolor virginica 0 50 50 > levels(iris2$Species)

[1] "setosa" "versicolor" "virginica"

Video Tutorial – The Moment Generating Function of the Exponential Distribution

In this video tutorial on YouTube, I derive the moment generating function (MGF) of the exponential distribution.  Visit my YouTube channel to watch more video tutorials!

A SAS macro to automatically label variables using another data set

Introduction

When I write SAS programs, I usually export the analytical results into an output that a client will read.  I often cannot show the original variable names in these outputs; there are 2 reasons for this:

• The maximal length of a SAS variable’s name is 32 characters, whereas the description of the variable can be much longer.  This is the case for my current job in marketing analytics.
• Only letters, numbers, and underscores are allowed in a SAS variable’s name.  Spaces and special characters are not allowed.  Thus, if a variable’s name is quite long and complicated to describe, then the original variable name would be not suitable for presentation or awkward to read.  It may be so abbreviated that it is devoid of practical meaning.

This is why labelling variables can be a good idea.  However, I usually label variables manually in a DATA step or within PROC SQL, which can be very slow and prone to errors.  I recently worked on a data set with 193 variables, most of which require long descriptions to understand what they mean.  Labelling them individually and manually was not a realistic method, so I sought an automated or programmatic way to do so.

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected.  Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;
set sashelp.class;

* define a new character variable to classify someone as tall or short;
if height > 60
then height_class = 'Tall';
else height_class = 'Short';
run;

* print the results for the first 5 rows;
proc print
data = c1 (obs = 5);
run;

Here is the result:

Obs Name Sex Age Height Weight height_class
1 Alfred M 14 69.0 112.5 Tall
2 Alice F 13 56.5 84.0 Shor
3 Barbara F 13 65.3 98.0 Tall
4 Carol F 14 62.8 102.5 Tall
5 Henry M 14 63.5 102.5 Tall

What happened?  Why does the word “Short” render as “Shor”?

Getting the names, types, formats, lengths, and labels of variables in a SAS data set

After reading my blog post on getting the variable names of a SAS data set, a reader named Robin asked how to get the formats as well.  I asked SAS Technical Support for help, and a consultant named Jerry Leonard provided a beautiful solution using PROC SQL.  Besides the names and formats of the variables, it also gives the types, lengths, and labels.  Here is an example of how to do so with the CLASS data set in the built-in SASHELP library.

* add formats and labels to 3 of the variables in the CLASS data set;
data class;
set sashelp.class;
format
age 8.
weight height 8.2
name $15.; label age = 'Age' weight = 'Weight' height = 'Height'; run; * extract the variable information using PROC SQL; proc sql noprint; create table class_info as select libname as library, memname as data_set, name as variable_name, type, length, format, label from dictionary.columns where libname = 'WORK' and memname = 'CLASS'; /* libname and memname values must be upper case */ quit; * print the resulting table; proc print data = class_info; run; Here is the result of that PROC PRINT step in the Results Viewer. Notice that it also has the type, length, format, and label of each variable. Obs library data_set variable_name type length format label 1 WORK CLASS Name char 8$15.
2 WORK CLASS Sex char 1
3 WORK CLASS Age num 8 8. Age
4 WORK CLASS Height num 8 8.2 Height
5 WORK CLASS Weight num 8 8.2 Weight

Thank you, Jerry, for sharing your tip!

Sorting correlation coefficients by their magnitudes in a SAS macro

Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, $X$ and $Y$, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations.  Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

1. computing the correlation coefficients
2. obtaining their absolute values
3. sorting them by their absolute values.

Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips.  I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance.  This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out!  I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series!  It is a pleasure to be a guest blogger for JMP!

Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

I am very excited to write again for the official JMP blog as a guest blogger!  Today, the first article of a 2-part series has been published, and it is called “Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP“.  This series of blog posts will talk about analysis of variance (ANOVA), sampling, and analytical chemistry, and it uses the quantification of sodium in potato chips as an example to illustrate these concepts.

The first part of this series discusses how to import the data into the JMP and prepare them for ANOVA.  Specifically, it illustrates how the “Stack Columns” function is used to transpose the data from wide format to long format.

I will present this at the Vancouver SAS User Group (VanSUG) meeting later today.