Video Tutorial: Naive Bayes Classifiers

Naive Bayes classifiers are simple but powerful tools for classification in statistics and machine learning.  In this video tutorial, I use a simulated data set and illustrate the mathematical details of how this technique works.

In my recent episode on The Central Equilibrium about word embeddings and text classification, Mandy Gu used naive Bayes classifiers to determine if a sentence is toxic or non-toxic – a very common objective when moderating discussions in online forums.  If you are not familiar with naive Bayes classifiers, then I encourage you to watch this video first before watching Mandy’s episode on The Central Equilibrium.

Mandy Gu on Word Embeddings and Text Classification – The Central Equilibrium – Episode 9

I am so grateful to Mandy Gu for being a guest on The Central Equilibrium to talk about word embeddings and text classification.  She began by showing how data from text can be encoded in vectors and matrices, and then she used a naive Bayes classifier to classify sentences as toxic or non-toxic – a very common problem for moderating discussions in online forums.  I learned a lot from her in this episode, and you can learn more from Mandy on her Medium blog.

If you are not familiar with naive Bayes classifiers, then I encourage you to watch my video tutorial about this topic first.

My Silver Medal from the Canadian Society for Chemistry – Reflections After 10 Years

In June, 2008, I received an email from Dr. Ken MacFarlane, then the Undergraduate Advisor in the Department of Chemistry at Simon Fraser University (SFU).  He wrote to inform me that I had won the Canadian Society for Chemistry‘s Silver Medal, given to the top undergraduate student in chemistry entering their final year of study at each Canadian university.

I won the Canadian Society for Chemistry’s Silver Medal for being the top fourth-year student in the Department of Chemistry at Simon Fraser University in 2008.

Later in November of that year, I received this medal at a dinner banquet, which honoured all of the award winners from the universities and colleges in the Vancouver Section of the Chemical Institute of Canada (CIC).  (Awards were given to the top students in their second year, third year, and fourth year of study.)  Here is a photo of me receiving my medal from Dr. Daniel Leznoff; he was then the Chair of the Vancouver Section of the CIC and a professor specializing in inorganic chemistry at SFU.

Eric getting medal from Dr. Leznoff

I received the Canadian Society for Chemistry’s Silver Medal from Dr. Daniel Leznoff at a dinner banquet in November, 2008.

The CIC published a magazine called Canadian Chemical News, and it covered the above award banquet in January, 2009.  You can find a photo of the award winners from that night on Page 29.

Dr. Cameron Forde succeeded Dr. MacFarlane as our Undergraduate Advisor in 2009.  In an email to me in October, 2009, Dr. Forde wrote that 100-120 students were eligible for the CSC’s Silver Medal in our department in 2008.

This is one of the greatest achievements of my life.  I am even more excited about it today than I was at that banquet, because I now have 10 years of perspective about how this medal has benefited my career.  In this retrospective article, I write to share my reflections about the impact that this medal has had on my professional trajectory – which has been unusual, to say the least.

Read more of this post

Communication Tip: Don’t say “next Friday”. Say “Friday of next week”.

Today is Monday, September 24, 2018.  Suppose that my co-worker Jessica asks me, “Can we meet on next Friday to talk about our report?”.  Does she mean

  • Friday, September 28, 2018?
  • Friday, October 5, 2018?

business-people-businesswoman-calendar-1187439

Image courtesy of rawpixel.com via Pexels

The word “next” is tricky to interpret in this situation.  By definition, “next” denotes the instance immediately after the present.  Thus, “next Friday” should mean Friday, September 28, 2018.

However, most Anglophones would interpret this phrase to mean the Friday of the next week, which is Friday, October 5, 2018.  This is not logical, but it is the dominant interpretation.

To prevent this confusion, I avoid saying “next Friday”.  Instead, I say

  • “the upcoming Friday” to denote Friday, September 28, 2018
  • “Friday of next week” to denote Friday, October 5, 2018

These approaches are clear and unequivocal, and they eliminate any chance for confusion.

If this is communicated in an email, then I suggest confirming the correct “Friday” by adding the calendar date.  Thus, I would write, “Let’s meet on Friday of next week, October 5, 2018”.  This method helps the reader to know if we will meet during this week or next week, and it adds another way to confirm the date.

 

FutureMakers Mega Meetup on Wednesday, September 26, 2018

I will attend the FutureMakers Mega Meetup on Wednesday, September 26, 2018.  It will be hosted by RBC’s Tech Community Team.  The registration is free but required.  Here are the details.

Wednesday, September 26, 2018
4:30 PM – 8:00 PM
Metro Toronto Convention Centre, South Building
222 Bremner Boulevard
Toronto, ON
M5V 3L9

If you will attend this event, then please come and say “Hello”!

Communication Tip: Write both the day of the week and the calendar date when organizing meetings or planning events

alarm-clock-calendar-close-up-908298

Image courtesy of rawpixel.com via Pexels

When proposing a meeting or planning an event in writing, I strongly suggest stating both the day of the week and the calendar date.  For example, I would email my co-worker Mark, “Shall we visit our client on next Tuesday, September 25?”.

Note the contrast between my proposed approach and the following 2 alternatives:

  • “Shall we visit our client on next Tuesday?”
  • “Shall we visit our client on September 25?”

Some careful comparisons will reveal 3 advantages:

  • It forces me to check that I wrote the correct pair between the day of the week and the calendar date.  This is an extra layer of quality control.
  • If I simply write “Shall we visit our client on September 25?”, then I implicitly force Mark to check what day of the week that is.  If I send that email to 10 people, then I’m multiplying this hassle by 10.  I can save all parties a lot of headache by taking the initiative to write “Tuesday, September 25”.
  • Knowing both are very helpful, but often for different reasons.
    • Knowing the specific calendar date eliminates any source of ambiguity about which day it is.  Instead of relying on words/phrases like “tomorrow”, “next Tuesday”, or “the day after”, stating “September 25” is perfectly clear to Mark.
    • If I propose a meeting on a Wednesday afternoon, Mark may immediately know that it is a bad time, because he needs to coach his daughter’s basketball team on Wednesday afternoons.  This illustrates how the day of the week is helpful for coordinating one-time events with events that recur weekly.

In the above example, I have omitted the year, because the working context between me and Mark would imply that meeting in September of next year would be rather strange and unrealistic.  However, stating the year may be helpful or even necessary for certain situations, especially if legal formality is involved.

 

RBC FutureMakers Talks – “Management Lessons from Leading Tech Teams” – Tuesday, September 18, 2018

I will attend the RBC FutureMakers Talks on Tuesday, September 18.  The title of this event is “Management Lessons from Leading Tech Teams”.  Here are the details.

Tuesday, September 18, 2018

5:30 PM – 8:30 PM EDT

RBC WaterPark Place – Auditorium

88 Queens Quay West

Toronto, ON

M5J 0B8

 

If you will attend this event, please come and say “Hello”!

Mitchell Boggs on Game Theory in Behavioural Ecology – The Central Equilibrium – Episode 8

Mitchell Boggs kindly talked about game theory in behavioural ecology on my talk show, “The Central Equilibrium”!  He talked about 2 key examples:

  • when animals choose to share or fight for food
  • when parents choose to care for their offspring or seek new mates to produce more offspring

These examples illustrate why seemingly disadvantageous behaviours can persist or even dominate in the animal kingdom.

Mitch recommends a book called “Are We Smart Enough to Know How Smart Animals Are?” by Frans de Waal.

Thanks for being such a great guest, Mitchell!

SORA Business Analytics Seminar – The role of geography in data integration and predictive analytics by Tony Lea – Friday, September 14, 2018

I will attend an upcoming seminar by Tony Lea, the Chief Methodologist at Environics Analytics.  The title of his seminar is “The role of geography in data integration and predictive analytics“.  Registration is free but required.

SORA

This seminar is organized by the Southern Ontario Regional Association (SORA) of the Statistical Society of Canada (SSC).  Here are the time, date, and location.

Friday, September 14, 2018
8:30 AM – 10:00 AM
Auditorium
RBC WaterPark Place
88 Queens Quay West
Toronto, ON
M5J 0B8

 

RBC WaterPark Place is conveniently located in downtown Toronto, and it is easily accessible from Union subway station.

If you will attend this seminar, then please feel free to come and say “Hello”!

Eric Cai Head Shot 9

Full disclosure: I work as a Digital Marketing Analyst at Environics Analytics.  Tony is my co-worker.

 

Communication and Email Tip: Propose meeting times in both time zones

When I arrange a phone call with someone in a different time zone, I propose the time in both my time zone and their time zone.

black-business-clocks-48770

Photo courtesy of Pixabay via Pexels.

This has 2 benefits:

1) I save the recipient’s time and headache from determining what the correct time is for their time zone.

2) The recipient can check if my conversion is correct.

On at least 2 occasions, this practice has helped me to identify a mistake in the proposed time of a meeting.

Arranging a teleconference via an online calendar invitation solves this problem, because the online calendar will automatically do the conversion. However, not all meetings are arranged this way, so this is still a good practice to adopt.

Update to “A Story About Perseverance – Inspiration From My Old Professor”

Names and details in this blog post have been altered to protect the privacy of its subjects.

In 2014, I wrote about a former professor, Dr. Baker, who suffered from a chronic liver disorder and endured complications from her liver transplant.

I recently heard from Dr. Perez and Dr. Baker about some wonderful news: Dr. Baker just earned tenure in her job as a professor.  This required her to get letters of recommendation from researchers in her field.  Trusted sources revealed that those letters contained glowing appraisals of Dr. Baker’s work.  I was very glad to learn of both this endorsement and the eventual attainment of a treasured milestone for Dr. Baker.

Besides this professional achievement, Dr. Baker has also improved her health significantly through disciplined care of her health, especially via exercise.  I was delighted to learn of this progress.

Congratulations, Dr. Baker.  Your example shows that perseverance can bring great rewards.  I hope that you and Dr. Perez enjoyed your celebratory lunch together.

David Veitch on Rational vs. Irrational Numbers and Countability – The Central Equilibrium – Episode 7

I am so grateful that David Veitch appeared on my talk show, “The Central Equilibrium“, to talk about rational vs. irrational numbers.  While defining irrational numbers, he proved that \sqrt{2} is an irrational number.  He then talked about the concept of bijections while defining countability, and he showed that rational numbers are countable.

David used to work as a bond trader for Bank of America.  He writes a personal blog, and you can follow him on Twitter (@daveveitch).  He recently earned admission into the Master of Science program in statistics at the University of Toronto, and he will begin that program soon.  Congratulations, David!  Thanks for being a guest on my show!

Part 1

 

Part 2

My new job as the Digital Marketing Analyst at Environics Analytics

As I approach my second anniversary of working at Environics Analytics, I am excited to accept a job offer to become our Digital Marketing Analyst.  In this new role, I am developing strategies to establish my company’s brand and promote our products and services on social media.  I am also using statistics to assess the effectiveness of our marketing efforts, both online and offline.

As The Chemical Statistician, I have written extensively on this blog, produced video tutorials on my YouTube channel, hosted a talk show (The Central Equilibrium), and shared my interests on Twitter (@chemstateric).  Mirroring these efforts in my new job, I will write articles on our company’s blog, produce YouTube videos, interview our staff, and engage with clients on Twitter (@EricCaiEA) and LinkedIn.

Eric sitting under EA logo

I am grateful to work with some wonderful colleagues who are friendly, helpful, and dedicated in their work.  It has been a pleasure to contribute to such a collaborative and joyful atmosphere, and I look forward to making a big impact with my new responsibilities!

Some SAS procedures (like PROC REG, GLM, ANOVA, SQL, and IML) end with “QUIT;”, not “RUN;”

Most SAS procedures require the

RUN;

statement to signal their termination.  However, there are some notable exceptions to this.

I have written about PROC SQL many times on my blog, and this procedure requires the

QUIT;

statement instead.

It turns out that there is another set of statistical procedures that require the QUIT statement, and some of them are very common.  They are called interactive procedures, and they include PROC REG, PROC GLM, and PROC ANOVAIf you end them with RUN rather than QUIT, then you will run into problems with displaying further output.  For example, if you try to output a data set from one such PROC and end it with the RUN statement, then you will get this error message:

ERROR: You cannot open WORK.MYDATA.DATA for input access with record-level 
control because WORK.MYDATA.DATA is in use by you in resource environment 
REG.

WORK.MYDATA cannot be opened.

You will also notice that the Program Editor says “PROC … running” in its banner when you end such a PROC with RUN rather than QUIT.

I don’t like this exception, but, alas, it does exist.  You can find out more about these interactive procedures in SAS Usage Note #37105.  As this note says, the ANOVA, ARIMA, CATMOD, FACTEX, GLM, MODEL, OPTEX, PLAN, and REG procedures are interactive procedures, and they all require the QUIT statement for termination.

PROC IML is not mentioned in that usage note, but this procedure also requires the QUIT statement.  Rick Wicklin has written an article about this on his blog, The DO Loop.

Arnab Chakraborty on The Monty Hall Problem and Bayes’ Theorem – The Central Equilibrium – Episode 6

I am pleased to welcome Arnab Chakraborty back to my talk show, “The Central Equilibrium“, to talk about the Monty Hall Problem and Bayes’ theorem.  In this episode, he shows 2 solutions to this classic puzzle in probability, and invokes Bayes’ Theorem for the second solution.

If you have not watched Arnab’s first episode on Bayes’ theorem, then I encourage you to do that first.

Marilyn Vos Savant provided a solution to this problem in PARADE Magazine in 1990-1991.  Thousands of readers disagreed with her solution and criticized her vehemently (and incorrectly) for her error.  Some of these critics were mathematicians!  She included some of those replies and provided alternative perspectives that led to the same conclusion.  Although I am dismayed by the disrespect that some people showed in their letters to her, I am glad that a magazine column on probability was able to attract so much readership and interest.  Arnab and I referred to one of her solutions in our episode.  Thank you, Marilyn!

Enjoy this episode of “The Central Equilibrium“!

Write a personal message when inviting people to connect on LinkedIn

Strangers send requests to join my network on LinkedIn every week, sometimes every day.  When I get such a request, the enclosing message is usually

“Hi Eric, I’d like to join your LinkedIn network.”

This is the default message, which means that the sender did not take the time to write a personalized invitation.  This is very disappointing, especially because LinkedIn suggests you to write a personal note before sending every request.

When you don’t write a personal message, it shows a lack of effort to engage with that person and develop a rapport in this new connection.  In this age of social media, it is easy and common to add new contacts just for the sake of increasing the size of one’s network, whether it’s “Friends” on Facebook, “Followers” on Twitter, or “Connections” on LinkedIn.  Although social networking is virtual, connecting with people is still a human endeavour, and your effort level in that endeavour will reap proportional returns in the long term.

In your personal note, here are possible things to mention:

  • how you met that person
  • what you valued in your past professional encounter(s) with that person
  • what you hope to learn from that person

 

If you accept a thoughtful invitation from someone on LinkedIn, then write a personal message in return to thank them.  Either way, read their profiles carefully, and ask insightful questions based on what you learn from their profiles.  Your new connections will recognize your efforts in noticing their work/education and trying to learn from them, and they will likely appreciate your initiative.

Benjamin Garden on Simple vs. Compound Interest in Finance – The Central Equilibrium – Episode 5

I am so pleased to publish this new episode of “The Central Equilibrium“, featuring Benjamin Garden.  He talked about simple and compound interest in the context of finance and investment, highlighting the power of compound interest to grow your money and to enlarge debt from credit cards.  We compared the formulas for calculating the accrued amounts under simple and compound interest, and we derived the formula for the Rule of 72, a short-cut to estimate the length of time needed to double your investment under compound interest.

Check out Ben’s blog, Twitter account (@GardenBenjamin), and Instagram account (@ben.garden) to get more advice about managing your money!

Part 1:

 

Part 2:

A tip about the word “college” to my American neighbo[u]rs who wish to work in Canada

Canadian English and American English are very similar, allowing Anglophones in both countries to work and live with ease when crossing the border.  However, there is a subtle difference in our vocabularies that can have big consequences for job searches and professional development.  To my American neighbours (or neighbors, as it is spelled in the United States of America), I offer this tip to avoid any confusion.  It concerns our different usages of the words “college” and “university”.

Peace_Arch_Monument,_Canada_-_USA_border

The Peace Arch is a monument situated between Blaine, Washington and Surrey, British Columbia. Near this monument is a major border crossing between the USA and Canada.

Image courtesy of RGB2 from Wikimedia.

Read more of this post

Beware of accidental replacement of data sets with PROC SORT in SAS

PROC SORT is a very useful procedure in SAS.  Not only can you sort a data set on one or more variables with it, but you can sort each variable in ascending or descending order, and you can use it to obtain unique observations or duplicated observationsHowever, there is a feature about PROC SORT that can be dangerous and deserves emphasis: If you are not careful, you can accidentally replace an existing, valuable data set.

Suppose that you wish to use PROC SORT to get only the duplicated records of a data set.  Here is an example of how to do it.

data heights;
     input Name $ 
           Age 
           Height;
     datalines;
Amy 15 174
Amy 16 177
Bob 14 172
Cam 13 163
Cam 17 181
;
run;

proc sort
     data = heights
          nouniquekey;
     by Name;
run;

proc print
     data = heights;
run;
Obs Name Age Height
1 Amy 15 174
2 Amy 16 177
3 Cam 13 163
4 Cam 17 181

Note that the record for “Bob” is gone from HEIGHTS, because it was a unique observation and, thus, removed in the above PROC SORT statement.

If the original data set is valuable, then this loss can be very damaging, especially if it took a lot of work and time to obtain the original data set.  This shows the danger of accidental replacement of a data set in SAS when using PROC SORT.

Read more of this post

Layne Newhouse on representing neural networks – The Central Equilibrium – Episode 4

I am excited to present the first of a multi-episode series on neural networks on my talk show, “The Central Equilibrium”.  My guest in this series in Layne Newhouse, and he talked about how to represent neural networks. We talked about the biological motivations behind neural networks, how to represent them in diagrams and mathematical equations, and a few of the common activation functions for neural networks.

Check it out!