My new job as the Digital Marketing Analyst at Environics Analytics

As I approach my second anniversary of working at Environics Analytics, I am excited to accept a job offer to become our Digital Marketing Analyst.  In this new role, I am developing strategies to establish my company’s brand and promote our products and services on social media.  I am also using statistics to assess the effectiveness of our marketing efforts, both online and offline.

As The Chemical Statistician, I have written extensively on this blog, produced video tutorials on my YouTube channel, hosted a talk show (The Central Equilibrium), and shared my interests on Twitter (@chemstateric).  Mirroring these efforts in my new job, I will write articles on our company’s blog, produce YouTube videos, interview our staff, and engage with clients on Twitter (@EricCaiEA) and LinkedIn.

eric-cai-in-front-of-ea-logo

I am grateful to work with some wonderful colleagues who are friendly, helpful, and dedicated in their work.  It has been a pleasure to contribute to such a collaborative and joyful atmosphere, and I look forward to making a big impact with my new responsibilities!

Advertisements

My Silver Medal from the Canadian Society for Chemistry – Reflections After 10 Years

In June, 2008, I received an email from Dr. Ken MacFarlane, then the Undergraduate Advisor in the Department of Chemistry at Simon Fraser University (SFU).  He wrote to inform me that I had won the Canadian Society for Chemistry‘s Silver Medal, given to the top undergraduate student in chemistry entering their final year of study at each Canadian university.

I won the Canadian Society for Chemistry’s Silver Medal for being the top fourth-year student in the Department of Chemistry at Simon Fraser University in 2008.

Later in November of that year, I received this medal at a dinner banquet, which honoured all of the award winners from the universities and colleges in the Vancouver Section of the Chemical Institute of Canada (CIC).  (Awards were given to the top students in their second year, third year, and fourth year of study.)  Here is a photo of me receiving my medal from Dr. Daniel Leznoff; he was then the Chair of the Vancouver Section of the CIC and a professor specializing in inorganic chemistry at SFU.

Eric getting medal from Dr. Leznoff

I received the Canadian Society for Chemistry’s Silver Medal from Dr. Daniel Leznoff at a dinner banquet in November, 2008.

The CIC publishes a magazine called Canadian Chemical News, and it covered the above award banquet in January, 2009.  You can find a photo of the award winners from that night on Page 29.

Dr. Cameron Forde succeeded Dr. MacFarlane as our Undergraduate Advisor in 2009.  In an email to me in October, 2009, Dr. Forde wrote that 100-120 students were eligible for the CSC’s Silver Medal in our department in 2008.

This is one of the greatest achievements of my life.  I am even more excited about it today than I was at that banquet, because I now have 10 years of perspective about how this medal has benefited my career.  In this retrospective article, I write to share my reflections about the impact that this medal has had on my professional trajectory – which has been unusual, to say the least.

Read more of this post

Some SAS procedures (like PROC REG, GLM, ANOVA, SQL, and IML) end with “QUIT;”, not “RUN;”

Most SAS procedures require the

RUN;

statement to signal their termination.  However, there are some notable exceptions to this.

I have written about PROC SQL many times on my blog, and this procedure requires the

QUIT;

statement instead.

It turns out that there is another set of statistical procedures that require the QUIT statement, and some of them are very common.  They are called interactive procedures, and they include PROC REG, PROC GLM, and PROC ANOVAIf you end them with RUN rather than QUIT, then you will run into problems with displaying further output.  For example, if you try to output a data set from one such PROC and end it with the RUN statement, then you will get this error message:

ERROR: You cannot open WORK.MYDATA.DATA for input access with record-level 
control because WORK.MYDATA.DATA is in use by you in resource environment 
REG.

WORK.MYDATA cannot be opened.

You will also notice that the Program Editor says “PROC … running” in its banner when you end such a PROC with RUN rather than QUIT.

I don’t like this exception, but, alas, it does exist.  You can find out more about these interactive procedures in SAS Usage Note #37105.  As this note says, the ANOVA, ARIMA, CATMOD, FACTEX, GLM, MODEL, OPTEX, PLAN, and REG procedures are interactive procedures, and they all require the QUIT statement for termination.

PROC IML is not mentioned in that usage note, but this procedure also requires the QUIT statement.  Rick Wicklin has written an article about this on his blog, The DO Loop.

Arnab Chakraborty on The Monty Hall Problem and Bayes’ Theorem – The Central Equilibrium – Episode 6

I am pleased to welcome Arnab Chakraborty back to my talk show, “The Central Equilibrium“, to talk about the Monty Hall Problem and Bayes’ theorem.  In this episode, he shows 2 solutions to this classic puzzle in probability, and invokes Bayes’ Theorem for the second solution.

If you have not watched Arnab’s first episode on Bayes’ theorem, then I encourage you to do that first.

Marilyn Vos Savant provided a solution to this problem in PARADE Magazine in 1990-1991.  Thousands of readers disagreed with her solution and criticized her vehemently (and incorrectly) for her error.  Some of these critics were mathematicians!  She included some of those replies and provided alternative perspectives that led to the same conclusion.  Although I am dismayed by the disrespect that some people showed in their letters to her, I am glad that a magazine column on probability was able to attract so much readership and interest.  Arnab and I referred to one of her solutions in our episode.  Thank you, Marilyn!

Enjoy this episode of “The Central Equilibrium“!

Write a personal message when inviting people to connect on LinkedIn

Strangers send requests to join my network on LinkedIn every week, sometimes every day.  When I get such a request, the enclosing message is usually

“Hi Eric, I’d like to join your LinkedIn network.”

This is the default message, which means that the sender did not take the time to write a personalized invitation.  This is very disappointing, especially because LinkedIn suggests you to write a personal note before sending every request.

When you don’t write a personal message, it shows a lack of effort to engage with that person and develop a rapport in this new connection.  In this age of social media, it is easy and common to add new contacts just for the sake of increasing the size of one’s network, whether it’s “Friends” on Facebook, “Followers” on Twitter, or “Connections” on LinkedIn.  Although social networking is virtual, connecting with people is still a human endeavour, and your effort level in that endeavour will reap proportional returns in the long term.

In your personal note, here are possible things to mention:

  • how you met that person
  • what you valued in your past professional encounter(s) with that person
  • what you hope to learn from that person

 

If you accept a thoughtful invitation from someone on LinkedIn, then write a personal message in return to thank them.  Either way, read their profiles carefully, and ask insightful questions based on what you learn from their profiles.  Your new connections will recognize your efforts in noticing their work/education and trying to learn from them, and they will likely appreciate your initiative.

Benjamin Garden on Simple vs. Compound Interest in Finance – The Central Equilibrium – Episode 5

I am so pleased to publish this new episode of “The Central Equilibrium“, featuring Benjamin Garden.  He talked about simple and compound interest in the context of finance and investment, highlighting the power of compound interest to grow your money and to enlarge debt from credit cards.  We compared the formulas for calculating the accrued amounts under simple and compound interest, and we derived the formula for the Rule of 72, a short-cut to estimate the length of time needed to double your investment under compound interest.

Check out Ben’s blog, Twitter account (@GardenBenjamin), and Instagram account (@ben.garden) to get more advice about managing your money!

Part 1:

 

Part 2:

A tip about the word “college” to my American neighbo[u]rs who wish to work in Canada

Canadian English and American English are very similar, allowing Anglophones in both countries to work and live with ease when crossing the border.  However, there is a subtle difference in our vocabularies that can have big consequences for job searches and professional development.  To my American neighbours (or neighbors, as it is spelled in the United States of America), I offer this tip to avoid any confusion.  It concerns our different usages of the words “college” and “university”.

Peace_Arch_Monument,_Canada_-_USA_border

The Peace Arch is a monument situated between Blaine, Washington and Surrey, British Columbia. Near this monument is a major border crossing between the USA and Canada.

Image courtesy of RGB2 from Wikimedia.

Read more of this post

Beware of accidental replacement of data sets with PROC SORT in SAS

PROC SORT is a very useful procedure in SAS.  Not only can you sort a data set on one or more variables with it, but you can sort each variable in ascending or descending order, and you can use it to obtain unique observations or duplicated observationsHowever, there is a feature about PROC SORT that can be dangerous and deserves emphasis: If you are not careful, you can accidentally replace an existing, valuable data set.

Suppose that you wish to use PROC SORT to get only the duplicated records of a data set.  Here is an example of how to do it.

data heights;
     input Name $ 
           Age 
           Height;
     datalines;
Amy 15 174
Amy 16 177
Bob 14 172
Cam 13 163
Cam 17 181
;
run;

proc sort
     data = heights
          nouniquekey;
     by Name;
run;

proc print
     data = heights;
run;
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Cam 13 163
4 Cam 17 181

Note that the record for “Bob” is gone from HEIGHTS, because it was a unique observation and, thus, removed in the above PROC SORT statement.

If the original data set is valuable, then this loss can be very damaging, especially if it took a lot of work and time to obtain the original data set.  This shows the danger of accidental replacement of a data set in SAS when using PROC SORT.

Read more of this post

Layne Newhouse on representing neural networks – The Central Equilibrium – Episode 4

I am excited to present the first of a multi-episode series on neural networks on my talk show, “The Central Equilibrium”.  My guest in this series in Layne Newhouse, and he talked about how to represent neural networks. We talked about the biological motivations behind neural networks, how to represent them in diagrams and mathematical equations, and a few of the common activation functions for neural networks.

Check it out!

Include a professional photo of yourself in business attire in your LinkedIn profile

One of the easiest ways to polish your LinkedIn profile is posting a photo of yourself in business attire.  I strongly encourage every LinkedIn user to spend several hours to take at least 100 such photos of yourself.  Ask a friend or family member to take these photographs, if they would be so kind and willing to do so.  Alternatively, you can hire a professional photographer.  After this session, you will have a large stock of photos that you can use for various purposes.

Position yourself with many backgrounds, and take those photos from many angles.  You should always look straight into the camera, smile, and maintain an upright posture.  Here is what my LinkedIn profile looks like.

When you build a professional network and an online brand, people need to know who you are and what you look like.  When they meet you in person, your photo allows them to visually connect you with the online profile that they saw on LinkedIn.

This is especially crucial for people with common names; showing your photo allows others to easily distinguish you from others who share your name.  It turns out that there is another person named Eric Cai who works as a data scientist!  Not only do we share the same name, but we also have the same profession.  Without photographs, it would be quite difficult to distinguish between us in a professional setting.

 

Common Mistakes

I recently spoke at the Canadian Statistics Student Conference and at the University of Toronto’s Biostatistics Research Day, and I talked about this with students at both events.  Here are the common mistakes that I see in LinkedIn profile photos, and I urge you to avoid all of them.

  • Not having a profile photo
  • Not wearing professional attire
  • Not smiling
  • Covering your eyes with sunglasses
  • Looking away from the camera

Remember: This photo is for your professional branding, and your future employers or clients will look at it.  It is not for Facebook, Tinder, Grindr, or other social networks that are personal in nature.  Do not try to be cute, funny, sexy, or controversial – be professional.

Highlighting cells to quickly view Average, Count, and Sum in Excel

I recently needed to check my answers after some data analysis in Alteryx.  I computed many averages using a formula in Alteryx, and I wanted to check those results by calculating the average for a few randomly selected rows.  I did this by invoking a helpful tool in Microsoft Excel.  I will illustrate this functionality with some random data.

In Column E, I used a formula to calculate the average of the 3 populations in Columns B, C, and D.  To manually check that the formula is correct, I highlighted the 3 columns for ID #125.  On the bottom right, Excel calculates the average; it’s difficult to see in the picture below, but Excel confirms that the average is 707,154.

Excel average

Whenever you highlight a range of cells containing numeric data, Excel will provide the average, count, and sum of the selected cells.  I did not know about this functionality when I first began working as a statistician, and I am very glad that I did learn it eventually – it is very useful for checking answers by analyzing a few randomly selected rows in Excel!

Career-Advice Panel at Biostatistics Research Day – SORA-TABA Workshop – Dalla Lana School of Public Health – University of Toronto – Friday, June 15, 2018

I will speak on the career-advice panel at Biostatistics Research Day, an event jointly hosted by

  • the Dalla Lana School of Public Health at the University of Toronto
  • the Southern Ontario Regional Association (SORA) of the Statistical Society of Canada (SSC)
  • the Toronto Applied Biostatistics Association (TABA)

This one-day conference will be held on Friday, June 15, from 8:00 am to 5:00 pm.  The location is HS610 in the Dalla Lana School of Public Health (155 College Street in Toronto, Ontario).  This room is a large auditorium on the 6th floor.

The career panel will be held from 12:15 pm to 2:00 pm.  If you will be at this conference, please feel free to come and say “Hello”!

A macro to execute PROC TTEST for multiple binary grouping variables in SAS (and sorting t-test statistics by their absolute values)

In SAS, you can perform PROC TTEST for multiple numeric variables in the same procedure.  Here is an example using the built-in data set SASHELP.BASEBALL; I will compare the number of at-bats and number of walks between the American League and the National League.

proc ttest
     data = sashelp.baseball;
     class League;
     var nAtBat nBB; 
     ods select ttests;
run;

Here are the resulting tables.

Method Variances DF t Value Pr > |t|
Pooled Equal 320 2.05 0.0410
Satterthwaite Unequal 313.66 2.06 0.04

Method Variances DF t Value Pr > |t|
Pooled Equal 320 0.85 0.3940
Satterthwaite Unequal 319.53 0.86 0.3884

 

What if you want to perform PROC TTEST for multiple grouping (a.k.a. classification) variables?  You cannot put more than one variable in the CLASS statement, so you would have to run PROC TTEST separately for each binary grouping variable.  If you do put LEAGUE and DIVISION in the same CLASS statement, here is the resulting log.

1303 proc ttest
1304 data = sashelp.baseball;
1305 class league division;
 --------
 22
 202
ERROR 22-322: Expecting ;.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
1306 var natbat;
1307 ods select ttests;
1308 run;

 

There is no syntax in PROC TTEST to use multiple grouping variables at the same time, so this tutorial provides a macro to do so.  There are several nice features about my macro:

  1. It allows you to use multiple grouping variables at the same time.
  2. It sorts the t-test statistics by their absolute values within each grouping variable.
  3. It shows the name of each continuous variable in the output table, unlike the above output.

Here is its basic skeleton.

Read more of this post

Use ODS EXCLUDE ALL to suppress printing output in SAS while producing output data sets

I regularly produce output data sets from a SAS procedure, such as getting the variable names from a data set in PROC CONTENTS.  In these instances, I often wish to suppress any printing of the output in HTML or TXT.  Such printing of the results is often unnecessary, and it can cost a lot of time and memory.

Some SAS procedures have the NOPRINT option that suppresses the printing of output, but this is limiting in several ways:

  1. Some SAS procedures do NOT have the NOPRINT option.  PROC TTEST is a prominent example.  I checked the high-performance procedures like PROC HPFOREST (random forest) and PROC HPSVM (support vector machine), and I could not find the NOPRINT option for these procedures.
  2. I cannot use ODS OUTPUT to produce output data sets while invoking the NOPRINT option.  Here is an example.

Read more of this post

Career Panel at the 2018 Canadian Statistics Student Conference – McGill University, Montreal, Quebec

I will speak on the career-advice panel at the 2018 Canadian Statistics Student Conference.  It will be held on Saturday, June 2, at McGill University.

cropped-eric-cai-head-shot-7.png

If you will attend this conference or the subsequent Annual Meeting of the Statistical Society of Canada, then I strongly recommend students to read my following advice articles in advance.

Convert multiple variables between character and numeric formats in SAS

Introduction

I often get data that are coded as character, but are actually meant to be numeric.  Thus, converting them into the correct variable types is a common task, and SAS Note #24590 shows how to do so.  However, I recently needed to do hundreds of these conversions, so I wanted some code to accomplish this quickly and accurately.  This tutorial shows how to do so.

Let’s consider this small data set in SAS as an example.  They are hypothetical statistics of 3 players from a basketball game.

data basketball1;
     input jersey points $ rebounds $ assists $;
     datalines;
21 10 14 1
4  11 3  12
23 29 4  5
;
run;

The 3 performance metrics (points, rebounds, and assists) are clearly numeric, but they are currently coded as character.  (You can use PROC CONTENTS to confirm this if needed.)

The jersey number is really a character variable, because its magnitude has no real-life meaning.  The National Basketball Association (NBA) allows “00” as a possible jersey number.  (Robert Parish wore this jersey number; he won 4 NBA championships and reached the Naismith Basketball Hall of Fame.)  If you code “00” as a numeric variable, then it will render as “0”.  Thus, for NBA jersey numbers, it is best to save it as a character variable.

I can convert these variables into the correct types using the following code.  Note that I chose “2.” for the length of “JERSEY”, because I know that jersey numbers in the NBA have, at most, 2 digits.

data basketball2;
     set basketball1;
 
     jersey2 = put(jersey, 2.);
     drop jersey;
     rename jersey2 = jersey;

     points2 = input(points, 8.);
     drop points;
     rename points2 = points;

     rebounds2 = input(rebounds, 8.);
     drop rebounds;
     rename rebounds2 = rebounds;

     assists2 = input(assists, 8.);
     drop assists;
     rename assists2 = assists;
run;

 

Despite this success, the above code can be very cumbersome when I need to do this for many variables, and this situation arose in my job recently.  In this tutorial, I will show a fast way of doing these conversions for many variables at once.  I will use this BASKETBALL1 data set as an example, and I will convert POINTS, REBOUNDS, and ASSISTS from character to numeric simultaneously.

Read more of this post

A macro to automate the creation of indicator variables in SAS

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS.  This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding.  I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line.  If you have not read my previous post on this topic, then I strongly encourage you to do that first.  Don’t use this macro blindly.

Here is the macro.  The key steps are

  1. Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
  2. Merge the original data with the newly created indicator variables
  3. Delete the “INDICATORS” data set, which was created in an intermediate step
%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
     data = &input_data
          noprint
          outdesign = indicators;
     class &covariates / param = glm;
     model &target = &covariates;
run;


data &output_data;
      merge    &input_data
               indicators (drop = Intercept &target);
run;


proc datasets 
     library = work
          noprint;
     delete indicators;
run;

%mend;

I will use the built-in data set SASHELP.CARS to illustrate the use of my macro.  As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables.  I will do that here for the variables TYPE, MAKE, and ORIGIN.

Read more of this post

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

  • They can be useful predictor variables in statistical models.
  • They can reduce the amount of memory required to store the data set.
  • They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process.  This is especially true if there are many categorical variables, or if a categorical variable has many categories.  In this tutorial, I will show an easy and efficient way to create indicator variables in SAS.  I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library.  Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately.  (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY STATE COUNTY ACTUAL PREDICT
U.S.A. California $987.36 $692.24
U.S.A. California $1,782.96 $568.48
U.S.A. California $32.64 $16.32
U.S.A. California $1,825.12 $756.16
U.S.A. California $750.72 $723.52

 

PRODTYPE PRODUCT YEAR QUARTER MONTH MONYR
FURNITURE SOFA 1995 1 Jan JAN95
FURNITURE SOFA 1995 1 Feb FEB95
FURNITURE SOFA 1995 1 Mar MAR95
FURNITURE SOFA 1995 2 Apr APR95
FURNITURE SOFA 1995 2 May MAY95

Read more of this post

Communication Tip – Write the message of the email BEFORE the subject and the recipients’ email addresses

In every email service that I have used so far,

1) the address fields are on the top

2) the subject field is in the middle

3) and then the text editor for the message is at the end.

However, when I write most emails, I usually write these 3 things in reverse.  This has several important advantages.

Email on laptop

Image courtesy of Pixabay on Pexels.

Read more of this post

Sort a data set by ascending or descending variables using PROC SORT in SAS

Consider the built-in data set SASHELP.CLASS in SAS.  Here are the first 5 observations from PROC PRINT.

Obs Name Sex Age Height Weight
1 Joyce F 11 51.3 50.5
2 Thomas M 11 57.5 85.0
3 James M 12 57.3 83.0
4 Jane F 12 59.8 84.5
5 John M 12 59.0 99.5

As you can clearly see, they are NOT sorted by weight.  Here is how you can sort the data set by weight using PROC SORT.

Read more of this post