## A Story About Perseverance – Inspiration From My Old Professor

Names and details in this blog post have been altered to protect the privacy of its subjects.

I met my old professor, Dr. Perez, for lunch recently.  We have kept in touch for many years since she taught me during my undergraduate studies, and she has been a good friend and mentor.  We had not seen each other for a few years, but we have been in regular contact over phone and email, exchanging stories, updates, photos of her grandchildren, frustrations, thrills, and perspectives.  It was nice to see her again.

I told her about the accomplishments and the struggles in my early career as a statistician so far.  I am generally satisfied with how I have performed since my entry into the statistics profession, but there are many skills that I don’t have or need to improve upon.  I want to learn distributed computing and become better at programming in Python, SQL, and Hadoop – skills that are highly in demand in my industry but not taught during my statistics education.  I want to be better at communicating about statistics to non-statisticians – not only helping them to understand difficult concepts, but persuading them to follow my guidance when I know that I am right.  I sometimes even struggle with seemingly basic questions that require much thinking and research on my part to answer.  While all of these are likely common weaknesses that many young statisticians understandably have, they contribute to my feeling of incompetence on occasion  – and it’s not pleasant to perform below my, my colleagues’, or my industry’s expectations for myself.

Dr. Perez listened and provided helpful observations and advice.  While I am working hard and focusing on my specific problems at the moment, she gave me a broader, more long-term perspective about how best to overcome these struggles, and I really appreciated it.  Beyond this, however, she told me a story about a professor of our mutual acquaintance that stunned and saddened me, yet motivated me to continue to work harder.

## Physical Chemistry Lesson of the Day – Standard Heats of Formation

The standard heat of formation, ΔHfº, of a chemical is the amount of heat absorbed or released from the formation of 1 mole of that chemical at 25 degrees Celsius and 1 bar from its elements in their standard states.  An element is in its standard state if it is in its most stable form and physical state (solid, liquid or gas) at 25 degrees Celsius and 1 bar.

For example, the standard heat of formation for carbon dioxide involves oxygen and carbon as the reactants.  Oxygen is most stable as O2 gas molecules, whereas carbon is most stable as solid graphite.  (Graphite is more stable than diamond under standard conditions.)

To phrase the definition in another way, the standard heat of formation is a special type of standard heats of reaction; the reaction is the formation of 1 mole of a chemical from its elements in their standard states under standard conditions.  The standard heat of formation is also called the standard enthalpy of formation (even though it really is a change in enthalpy).

By definition, the formation of an element from itself would yield no change in enthalpy, so the standard heat of reaction for all elements is zero.

## Machine Learning Lesson of the Day – Introduction to Linear Basis Function Models

Given a supervised learning problem of using $p$ inputs ($x_1, x_2, ..., x_p$) to predict a continuous target $Y$, the simplest model to use would be linear regression.  However, what if we know that the relationship between the inputs and the target is non-linear, but we are unsure of exactly what form this relationship has?

One way to overcome this problem is to use linear basis function models.  These models assume that the target is a linear combination of a set of $p+1$ basis functions.

$Y_i = w_0 + w_1 \phi_1(x_1) + w_2 \phi_2(x_2) + ... + w_p \phi_p(x_p)$

This is a generalization of linear regression that essentially replaces each input with a function of the input.  (A linear basis function model that uses the identity function is just linear regression.)

The type of basis functions (i.e. the type of function given by $\phi$) is chosen to suitably model the non-linearity in the relationship between the inputs and the target.  It also needs to be chosen so that the computation is efficient.  I will discuss variations of linear basis function models in a later Machine Learning Lesson of the Day.

## Applied Statistics Lesson of the Day – Additive Models vs. Interaction Models in 2-Factor Experimental Designs

In a recent “Machine Learning Lesson of the Day“, I discussed the difference between a supervised learning model in machine learning and a regression model in statistics.  In that lesson, I mentioned that a statistical regression model usually consists of a systematic component and a random component.  Today’s lesson strictly concerns the systematic component.

An additive model is a statistical regression model in which the systematic component is the arithmetic sum of the individual effects of the predictors.  Consider the simple case of an experiment with 2 factors.  If $Y$ is the response and $X_1$ and $X_2$ are the 2 predictors, then an additive linear model for the relationship between the response and the predictors is

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon$

In other words, the effect of $X_1$ on $Y$ does not depend on the value of $X_2$, and the effect of $X_2$ on $Y$ does not depend on the value of $X_1$.

In contrast, an interaction model is a statistical regression model in which the systematic component is not the arithmetic sum of the individual effects of the predictors.  In other words, the effect of $X_1$ on $Y$ depends on the value of $X_2$, or the effect of $X_2$ on $Y$ depends on the value of $X_1$.  Thus, such a regression model would have 3 effects on the response:

1. $X_1$
2. $X_2$
3. the interaction effect of $X_1$ and $X_2$

For example, a full factorial design with 2 factors uses the 2-factor ANOVA model and assumes a linear relationship between the response and the above 3 effects.

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon$

Note that additive models and interaction models are not confined to experimental design; I have merely used experimental design to provide examples for these 2 types of models.

## Physical Chemistry Lesson of the Day – Hess’s Law

Hess’s law states that the change in enthalpy of a multi-stage chemical reaction is just the sum of the changes of enthalpy of the individual stages.  Thus, if a chemical reaction can be written as a sum of multiple intermediate reactions, then its change in enthalpy can be easily calculated.  This is especially helpful for a reaction whose change in enthalpy is difficult to measure experimentally.

Hess’s law is a consequence of the fact that enthalpy is a state function; the path between the reactants and the products is irrelevant to the change in enthalpy – only the initial and final values matter.  Thus, if there is a path for which the intermediate values of $\Delta H$ are easy to obtain experimentally, then their sum equal the $\Delta H$ for the overall reaction.

## Machine Learning Lesson of the Day – Memory-Based Learning

Memory-based learning (also called instance-based learning) is a type of non-parametric algorithm that compares new test data with training data in order to solve the given machine learning problem.  Such algorithms search for the training data that are most similar to the test data and make predictions based on these similarities.  (From what I have learned, memory-based learning is used for supervised learning only.  Can you think of any memory-based algorithms for unsupervised learning?)

A distinguishing feature of memory-based learning is its storage of the entire training set.  This is computationally costly, especially if the training set is large – the storage itself is costly, and the complexity of the model grows with a larger data set.  However, it is advantageous because it uses less assumptions than parametric models, so it is adaptable to problems for which the assumptions may fail and no clear pattern is known ex ante.  (In contrast, parametric models like linear regression make generalizations about the training data; after building a model to predict the targets, the training data are discarded, so there is no need to store them.)  Thus, I recommend using memory-based learning algorithms when the data set is relatively small and there is no prior knowledge or information about the underlying patterns in the data.

Two classic examples of memory-based learning are K-nearest neighbours classification and K-nearest neighbours regression.

## Applied Statistics Lesson of the Day – The Full Factorial Design

An experimenter may seek to determine the causal relationships between $G$ factors and the response, where $G > 1$.  On first instinct, you may be tempted to conduct $G$ separate experiments, each using the completely randomized design with 1 factor.  Often, however, it is possible to conduct 1 experiment with $G$ factors at the same time.  This is better than the first approach because

• it is faster
• it uses less resources to answer the same questions
• the interactions between the $G$ factors can be examined

Such an experiment requires the full factorial design.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.

The simplest full factorial experiment consists of 2 factors, each with 2 levels.  Such an experiment would result in $2 \times 2 = 4$ treatments, each being a combination of 1 level from the first factor and 1 level from the second factor.  Since this is a full factorial design, experimental units are independently assigned to all treatments.  The 2-factor ANOVA model is commonly used to analyze data from such designs.

In later lessons, I will discuss interactions and 2-factor ANOVA in more detail.

## Physical Chemistry Lesson of the Day – The Perpetual Motion Machine

A thermochemical equation is a chemical equation that also shows the standard heat of reaction.  Recall that the value given by ΔHº is only true when the coefficients of the reactants and the products represent the number of moles of the corresponding substances.

The law of conservation of energy ensures that the standard heat of reaction for the reverse reaction of a thermochemical equation is just the forward reaction’s ΔHº multiplied by -1.  Let’s consider a thought experiment to show why this must be the case.

Imagine if a forward reaction is exothermic and has a ΔHº = -150 kJ, and its endothermic reverse reaction has a ΔHº = 100 kJ.  Then, by carrying out the exothermic forward reaction, 150 kJ is released from the reaction.  Out of that released heat, 100 kJ can be used to fuel the reverse reaction, and 50 kJ can be saved as a “profit” for doing something else, such as moving a machine.  This can be done perpetually, and energy can be created forever – of course, this has never been observed to happen, and the law of conservation of energy prevents such a perpetual motion machine from being made.  Thus, the standard heats of reaction for the forward and reverse reactions of the same thermochemical equation have the same magnitudes but opposite signs.

Regardless of how hard the reverse reaction may be to carry out, its ΔHº can still be written.

## Machine Learning Lesson of the Day – K-Nearest Neighbours Regression

I recently introduced the K-nearest neighbours classifier.  Some slight adjustments to the same algorithm can make it into a regression technique.

Given a training set and a new input $X$, we can predict the target of the new input by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$ by Euclidean distance
2. build a linear regression model to predict the target for $X$
• the K data are the predictors
• the reciprocals of the predictors’ distances to $X$ are their respective regression coefficients (the “weights”)

Validation or cross-validation can be used to determine the best number of “K”.

## Useful Functions in R for Manipulating Text Data

#### Introduction

In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.)  In this post, I describe some common functions in R that I often use for text processing.

#### Obtaining Basic Information about Character Variables

In R, I often work with text data in the form of character variables.  To check if a variable is a character variable, use the is.character() function.

> year = 2014
> is.character(year)
[1] FALSE

If a variable is not a character variable, you can convert it to a character variable using the as.character() function.

> year.char = as.character(year)
> is.character(year.char)
[1] TRUE

A basic piece of information about a character variable is the number of characters that exist in this string.  Use the nchar() function to obtain this information.

> nchar(year.char)
[1] 4


## Applied Statistics Lesson of the Day – The Matched-Pair (or Paired) t-Test

My last lesson introduced the matched pairs experimental design, which is a special type of the randomized blocked design.  Let’s now talk about how to analyze the data from such a design.

Since the experimental units are organized in pairs, the units between pairs (blocks) are not independently assigned.  (The units within each pair are independently assigned – returning to the glove example, one hand is randomly chosen to wear the nitrile glove, while the other is randomly chosen to wear the latex glove.)  Because of this lack of independence between pairs, the independent 2-sample t-test is not applicable.  Instead, use the matched pair t-test (also called the paired or the paired  difference t-test).  This is really a 1-sample t-test that tests the difference between the responses of of the experimental and the control groups.

## Physical Chemistry Lesson of the Day – State Functions vs. Path Functions

Today’s lesson may seem mundane; despite its subtlety, it is actually quite important.  I needed to spend some time to learn it and digest it, and it was time well spent – these concepts are essential for understanding much of thermodynamics.  For brevity, I have not dove into the detailed mathematics of exact differentials, though I highly recommend you to learn it and review the necessary calculus.

Some thermodynamic properties of a system can be described by state variables, while others can be described by path variables.

A state variable is a variable that depends only on the final and initial states of a system and not on the path connecting these states.  Internal energy and enthalpy are examples of state functions.  For example, in a previous post on the First Law of Thermodynamics, I defined the change in internal energy, $latex \Delta U$, as

$\Delta U = \int_{i}^{f} dU = U_f - U_i$.

State variables can be calculated by exact differentials.

A path variable is a variable that depends on the sequence of steps that takes the system from the initial state to the final state.  This sequence of steps is called the path.  Heat and work are examples of path variables.  Path variables cannot be calculated by exact differentials.  In fact, the following quantities may seem to have plausible interpretations, but they actually do not exist:

• change in heat ($\Delta q$)
• initial heat ($q_i$)
• final heat ($q_f$)
• change in work ($\Delta w$)
• initial work ($w_i$)
• final work ($w_f$)

There is no such thing as heat or work being possessed by a system.  Heat and work can be transferred between the system and the surroundings, but the end result is an increase or decrease in internal energy; neither the system or the surroundings possesses heat or work.

A state/path variable is also often called a state/path function or a state/path quantity.

## Simon Fraser University Outstanding Alumni Awards – Wednesday, February 26, 2014 @ Four Seasons Hotel in Vancouver

I am delighted to have been invited to attend the SFU Outstanding Alumni Awards Dinner on this coming Wednesday, February 26!  I am grateful to the SFU Library for inviting me to this wonderful event to celebrate some remarkable graduates from my alma mater.  (I volunteered as a Learning and Writing Skills Peer Educator in the SFU Library’s Student Learning Commons for 7 years.  This was one of the most valuable experiences of my undergraduate education, and it is a pleasure to be participate in this event with the Library.)

If you will attend this event, please do come to the Library’s table and say “Hello!”

Event Date:
Wednesday, February 26, 2014

Time:
Reception: 6:00pm
Dinner + Awards: 6:45pm

Location:
Four Seasons Hotel
791 West Georgia Street
Vancouver, BC

## Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

The K-nearest neighbours (KNN) classifier is a non-parametric classification technique that classifies an input $X$ by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$
2. counting the number of “neighbours” that belong to each class of the target variable
3. classifying $X$ by the most common class to which its neighbours belong

K is usually an odd number to avoid resolving ties.

The proximity of the neighbours to $X$ is usually defined by Euclidean distance.

Validation or cross-validation can be used to determine the best number of “K”.

## Applied Statistics Lesson of the Day – The Matched Pairs Experimental Design

The matched pairs design is a special type of the randomized blocked design in experimental design.  It has only 2 treatment levels (i.e. there is 1 factor, and this factor is binary), and a blocking variable divides the $n$ experimental units into $n/2$ pairs.  Within each pair (i.e. each block), the experimental units are randomly assigned to the 2 treatment groups (e.g. by a coin flip).  The experimental units are divided into pairs such that homogeneity is maximized within each pair.

For example, a lab safety officer wants to compare the durability of nitrile and latex gloves for chemical experiments.  She wants to conduct an experiment with 30 nitrile gloves and 30 latex gloves to test her hypothesis.  She does her best to draw a random sample of 30 students in her university for her experiment, and they all perform the same organic synthesis using the same procedures to see which type of gloves lasts longer.

She could use a completely randomized design so that a random sample of 30 hands get the 30 nitrile gloves, and the other 30 hands get the 30 latex gloves.  However, since lab habits are unique to each person, this poses a confounding variable – durability can be affected by both the material and a student’s lab habits, and the lab safety officer only wants to study the effect of the material.  Thus, a randomized block design should be used instead so that each student acts as a blocking variable – 1 hand gets a nitrile glove, and 1 hand gets a latex glove.  Once the gloves have been given to the student, the type of glove is randomly assigned to each hand; some may get the nitrile glove on their left hand, and some may get it on their right hand.  Since this design involves one binary factor and blocks that divide the experimental units into pairs, this is a matched pairs design.

## Physical Chemistry Lesson of the Day – Standard Heats of Reaction

The change in enthalpy of a chemical reaction indicates how much heat is absorbed or released by the system.  This is valuable information in chemistry, because the exchange in heat affects the reaction conditions and the surroundings, and that needs to be managed and taken into account – in theory, in the laboratory, in industry or in nature in general.

Chemists often want to compare the changes in enthalpy between different reactions.  Since changes in enthalpy depend on both temperature and pressure, we need to control for these 2 confounding variables by using a reference set of temperature and pressure.  This set of conditions is called the standard conditions, and it sets the standard temperature at 298 degrees Kelvin and the standard pressure at 1 bar.  (IUPAC changed the definition of standard pressure from 1 atmosphere to 1 bar in 1982.  The actual difference in pressure between these 2 definitions is very small.)

The standard enthalpy of reaction (or standard heat of reaction) is the change in enthalpy of a chemical reaction under standard conditions; the actual number of moles are specified by the coefficients of the balanced chemical equation.  (Since enthalpy is an extensive property, the same reaction under standard conditions could have different changes in enthalpy with different amounts of the reactants and products.  Thus, the number of moles of the reaction must be standardized somehow when defining the standard enthalpy of reaction.)  The standard enthalpy of reaction has the symbol ΔHº; the º symbol indicates the standard conditions.

## The Chemical Statistician Celebrates its 1st Anniversary!

Dear Readers of The Chemical Statistician:

Yesterday was the 1st anniversary of The Chemical Statistician, and I am so glad that you have taken the time to visit my blog and allow me to share my passion about statistics, chemistry and machine learning with you.  I started this blog as a way to learn new things, document what I know, and understand concepts on a deeper level.  Little did I know that viewers from 146 countries would pay more than 34,000 visits to my blog in its first year!  (This does not include the thousands of views on my syndicated blog posts on AnalyticBridge!)

Since the beginning of 2014, I have significantly expanded the scope of The Chemical Statistician.  In addition to my classic and popular tutorials that often come with in-depth explanations and fully commented programming scripts, it now features

## A Pause from Blogging and a Call for Ideas – Improving The Chemical Statistician for its 2nd Year and Beyond.

Dear Readers of The Chemical Statistician,

Thank you all for your loyal viewership of my blog!  As The Chemical Statistician approaches its 1-year anniversary (February 17), I would like to take 2 weeks off from writing and review the direction of where my blog is going.  I am very proud of what I have accomplished so far, especially with this year’s expansion into daily lessons on weekdays, advice columns, and a new Youtube video channel.  However, there is a lot more that I want to accomplish with this blog, and I want to hear from you in this regard.  What do you like or dislike?  What topics would you like me to cover?  What format or delivery would you like me to add to the existing presentation of my blog?

Your comments are welcomed and most appreciated.  The wonderful feedback that I have received from you via LinkedIn, Twitter, AnalyticBridge, and this blog’s comments has been delightfully overwhelming.  Your positive encouragement provides a lot of motivation for me to continue and expand this blog, so thank you all for your generous support!

Eric

## Physical Chemistry Lesson of the Day – Intensive vs. Extensive Properties

An extensive property is a property that depends on the size of the system.  Examples include

An intensive property is a property that does not depend on the size of the system.  Examples include

As you can see, some intensive properties can be derived from extensive properties by dividing an extensive property by the mass, volume, or number of moles of the system.

## Machine Learning Lesson of the Day – Overfitting

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set.  That underlying trend cannot be precisely captured because of the random variation in the data around that trend.  A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation.  An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

• A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target.  If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already.  Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise.  Thus, the new model with these added irrelevant predictors describes the trend and the noise.  It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post.  Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.