## Video Tutorial – Obtaining the Expected Value of the Exponential Distribution Using the Moment Generating Function

In this video tutorial on YouTube, I use the exponential distribution’s moment generating function (MGF) to obtain the expected value of this distribution.  Visit my YouTube channel to watch more video tutorials!

Image courtesy of rawpixel.com on Pexels.

## Remove leading blanks when creating macro variables using PROC SQL in SAS

I regularly use PROC SQL to create macro variables in SAS, and I recently noticed a strange phenomenon when resolving a macro variable within double quotation marks in the title of a plot.  Thankfully, I was able to replicate this problem using the SASHELP.BASEBALL data set, which is publicly available.  I was then able to send the code and the strange result to SAS Technical Support for their examination.

proc sql;
select count(name)
into   :hitters_100plusHR
from   sashelp.baseball
where  CrHome > 100;
quit;

proc sgplot
data = sashelp.baseball;
histogram Salary;
title1 'Distribution of salaries';
title2 "Restricted to the &hitters_100plusHR hitters with more than 100 career home runs";
run;

Here is the resulting plot.  Notice the extra spaces before “72” in the title of the plot.

SAS Technical Support informed me that

• this problem is commonly known.
• there is no way of predicting when it will occur
• for now, the best way to deal with it is to remove the leading blanks using one of several ways.

## Use unique() instead of levels() to find the possible values of a factor in R

*In a previous version of this blog post, I incorrectly wrote that "Species" is a character variable.  Instead, it is a factor.  I thank the readers who corrected me in the comments.

When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable.  However, I recently noticed something very strange about this function.

Consider the built-in data set “iris” and its factor “Species”.  Here are the possible values of “Species”, as shown by the levels() function.

> levels(iris$Species) [1] "setosa" "versicolor" "virginica" Now, let’s remove all rows containing “setosa”. I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again. > iris2 = subset(iris, Species != 'setosa') > table(iris2$Species)

setosa versicolor virginica
0         50        50

> levels(iris2$Species) [1] "setosa" "versicolor" "virginica" ## Video Tutorial – The Moment Generating Function of the Exponential Distribution In this video tutorial on YouTube, I derive the moment generating function (MGF) of the exponential distribution. Visit my YouTube channel to watch more video tutorials! ## A SAS macro to automatically label variables using another data set #### Introduction When I write SAS programs, I usually export the analytical results into an output that a client will read. I often cannot show the original variable names in these outputs; there are 2 reasons for this: • The maximal length of a SAS variable’s name is 32 characters, whereas the description of the variable can be much longer. This is the case for my current job in marketing analytics. • Only letters, numbers, and underscores are allowed in a SAS variable’s name. Spaces and special characters are not allowed. Chris recommends “Introduction to Probability Models” by Sheldon Ross to learn more about probability theory and Markov chains. The Central Equilibrium is my new talk show about math, science, and economics. It focuses on technical topics that involve explanations with formulas, equations, graphs, and diagrams. Stay tuned for more episodes in the coming weeks! You can watch all of my videos on my YouTube channel! Please watch the video on this blog. You can also watch it directly on YouTube. ## Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting. I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems. This technique uses the INTO statement in PROC SQL. To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data setI will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements. To do this, I first obtain the variable names and types of a data set using PROC CONTENTS. * create a data set of the variable names; proc contents data = sashelp.baseball noprint out = bvars (keep = name type); run; ## Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected. Here is an example involving the built-in data set SASHELP.CLASS. Here is the code: data c1; set sashelp.class; * define a new character variable to classify someone as tall or short; if height > 60 then height_class = 'Tall'; else height_class = 'Short'; run; * print the results for the first 5 rows; proc print data = c1 (obs = 5); run; Here is the result: Obs Name Sex Age Height Weight height_class 1 Alfred M 14 69.0 112.5 Tall 2 Alice F 13 56.5 84.0 Shor 3 Barbara F 13 65.3 98.0 Tall 4 Carol F 14 62.8 102.5 Tall 5 Henry M 14 63.5 102.5 Tall What happened? Why does the word “Short” render as “Shor”? ## Neil Seoni on the Fourier Transform and the Sampling Theorem – The Central Equilibrium – Episode 1 I am very excited to publish the very first episode of my new talk show, The Central Equilibrium! My guest is Neil Seoni, an undergraduate student in electrical and computer engineering at Rice University in Houston, Texas. He has studied data science in his spare time, most notably taking a course on machine learning by Andrew Ng on Coursera. I spoke at the 2016 Canadian Statistics Student Conference on career advice for students and new graduates in statistics. Image courtesy of Peter Macdonald on Flickr. In fact, most of these tips apply to public speaking in general. I spoke at the 2016 Canadian Statistics Student Conference on career advice for students and new graduates in statistics. Image courtesy of Peter Macdonald on Flickr. ## Maximizing Your Learning Potential at Professional Conferences – A Detailed Guide ### Introduction During last summer, I attended the 2016 Annual Meeting of the Statistical Society of Canada (SSC). I spoke on the career-advice panel at the 2016 Canadian Statistics Student Conference (CSSC), and I met some colleagues and professors to share ideas about our mutual interests in statistics, statistical education, and the use of social media to promote statistics to the general public. From observing and talking to many students at this conference, I realized that most of them did not use it effectively to maximize their learning potential. Image courtesy of Rufino from Wikimedia Commons. To build valuable relationships in your professional network Unfortunately, based on my anecdotal observations, many students in statistics, math and science don’t seem to grasp Objectives #3-4. These students tend to be passive in their attendance and shy in their participation. When they do try to pursue Objectives #3-4, they are often unprepared and do not take advantage of all of the learning opportunities that are available to them. The first step in maximizing your learning potential at a professional conference is recognizing that it takes preparation and hard work. To do it well, you need to take all 4 objectives seriously and practice them frequently. Attending a professional conference is a skill, and developing this skill requires thought and effort. It involves much more than just showing up, talking at your turn, and listening at all other times. Hopefully, the rest of this article will help you to develop this skill in an intelligent way, but you must realize that there is no substitution for hard work. ## Getting the names, types, formats, lengths, and labels of variables in a SAS data set After reading my blog post on getting the variable names of a SAS data set, a reader named Robin asked how to get the formats as well. I asked SAS Technical Support for help, and a consultant named Jerry Leonard provided a beautiful solution using PROC SQL. Besides the names and formats of the variables, it also gives the types, lengths, and labels. Here is an example of how to do so with the CLASS data set in the built-in SASHELP library. * add formats and labels to 3 of the variables in the CLASS data set; data class; set sashelp.class; format age 8. weight height 8.2 name$15.;
label
age = 'Age'
weight = 'Weight'
height = 'Height';
run;

* extract the variable information using PROC SQL;
proc sql
noprint;
create table class_info as
select libname as library,
memname as data_set,
name as variable_name,
type,
length,
format,
label
from dictionary.columns
where libname = 'WORK' and memname = 'CLASS';
/* libname and memname values must be upper case  */
quit;

* print the resulting table;
proc print
data = class_info;
run;

Here is the result of that PROC PRINT step in the Results Viewer.  Notice that it also has the type, length, format, and label of each variable.

Obs library data_set variable_name type length format label
1 WORK CLASS Name char 8 \$15.
2 WORK CLASS Sex char 1
3 WORK CLASS Age num 8 8. Age
4 WORK CLASS Height num 8 8.2 Height
5 WORK CLASS Weight num 8 8.2 Weight

Thank you, Jerry, for sharing your tip!

## Sorting correlation coefficients by their magnitudes in a SAS macro

#### Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, $X$ and $Y$, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations.  Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

1. computing the correlation coefficients
2. obtaining their absolute values
3. sorting them by their absolute values.

## Analytical Chemistry Lesson of the Day – Accuracy in Method Validation and Quality Assurance

In pharmaceutical chemistry, one of the requirements for method validation is accuracy, the ability of an analytical method to obtain a value of a measurement that is close to the true value. There are several ways of assessing an analytical method for accuracy.

1. Compare the value from your analytical method with an established or reference method.
2. Use your analytical method to obtain a measurement from a sample with a known quantity (i.e. a reference material), and compare the measured value with the true value.
3. If you don’t have a reference material for the second way, you can make your own by spiking a blank matrix with a measured quantity of the analyte.
4. If your matrix may interfere with the analytical signal, then you cannot spike a blank matrix as described in the third way.  Instead, spike your sample with an known quantity of the standard.  I elaborate on this in a separate tutorial on standard addition, a common technique in analytical chemistry for determining the quantity of a substance when matrix interference exists.  Standard addition is an example of the second way of assessing accuracy as I mentioned above.  You can view the original post of this tutorial on the official JMP blog.

