## Physical Chemistry Lesson of the Day – Intensive vs. Extensive Properties

An extensive property is a property that depends on the size of the system.  Examples include

An intensive property is a property that does not depend on the size of the system.  Examples include

As you can see, some intensive properties can be derived from extensive properties by dividing an extensive property by the mass, volume, or number of moles of the system.

## Machine Learning Lesson of the Day – Overfitting

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set.  That underlying trend cannot be precisely captured because of the random variation in the data around that trend.  A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation.  An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

• A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target.  If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already.  Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise.  Thus, the new model with these added irrelevant predictors describes the trend and the noise.  It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post.  Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.

## Applied Statistics Lesson of the Day – Blocking and the Randomized Complete Blocked Design (RCBD)

A completely randomized design works well for a homogeneous population – one that does not have major differences between any sub-populations.  However, what if a population is heterogeneous?

Consider an example that commonly occurs in medical studies.  An experiment seeks to determine the effectiveness of a drug on curing a disease, and 100 patients are recruited for this double-blinded study – 50 are men, and 50 are women.  An abundance of biological knowledge tells us that men and women have significantly physiologies, and this is a heterogeneous population with respect to gender.  If a completely randomized design is used for this study, gender could be a confounding variable; this is especially true if the experimental group has a much higher proportion of one gender, and the control group has a much higher proportion of the other gender.  (For instance, purely due to the randomness, 45 males may be assigned to the experimental group, and 45 females may be assigned to the control group.)  If a statistically significant difference in the patients’ survival from the disease is observed between such a pair of experimental and control groups, this effect could be attributed to the drug or to gender, and that would ruin the goal of determining the cause-and-effect relationship between the drug and survival from the disease.

To overcome this heterogeneity and control for the effect of gender, a randomized blocked design could be used.  Blocking is the division of the experimental units into homogeneous sub-populations before assigning treatments to them.  A randomized blocked design for our above example would divide the males and females into 2 separate sub-populations, and then each of these 2 groups is split into the experimental and control group.  Thus, the experiment actually has 4 groups:

1. 25 men take the drug (experimental)
2. 25 men take a placebo (control)
3. 25 women take the drug (experimental)
4. 25 women take a placebo (control)

Essentially, the population is divided into blocks of homogeneous sub-populations, and a completely randomized design is applied to each block.  This minimizes the effect of gender on the response and increases the precision of the estimate of the effect of the drug.

## Physical Chemistry Lesson of the Day – The Effect of Temperature on Changes in Internal Energy and Enthalpy

When the temperature of a system increases, the kinetic and potential energies of the atoms and molecules in the system increase.  Thus, the internal energy of the system increases, which means that the enthalpy of the system increases – this is true under constant pressure or constant volume.

Recall that the heat capacity of a system is the amount of energy that is required to raise the system’s temperature by 1 degree Kelvin.  Since the heat absorbed by the system in a thermodynamic process is the increase in enthalpy of the system, the heat capacity is just the change in enthalpy divided by the change in temperature.

$C = \Delta H \div \Delta T$.

## Video Tutorial: Breaking Down the Definition of the Hazard Function

The hazard function is a fundamental quantity in survival analysis.  For an event occurring at some time on a continuous time scale, the hazard function, $h(t)$, for that event is defined as

$h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t]$,

where

• $t$ is the time,
• $X$ is the time of the occurrence of the event.

However, what does this actually mean?  In this Youtube video, I break down the mathematics of this definition into its individual components and explain the intuition behind each component.

I am very excited about the release of this first video in my new Youtube channel!  This is yet another mode of expansion of The Chemical Statistician since the beginning of 2014.  As always, your comments are most appreciated!

## Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

A model is a simplified representation of reality, and the simplifications are made to discard unnecessary detail and allow us to focus on the aspect of reality that we want to understand.  These simplifications are grounded on assumptions; these assumptions may hold in some situations, but may not hold in other situations.  This implies that a model that explains a certain situation well may fail in another situation.  In both statistics and machine learning, we need to check our assumptions before relying on a model.

The “No Free Lunch” theorem states that there is no one model that works best for every problem.  The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem.  This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model.  A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

Depending on the problem, it is important to assess the trade-offs between speed, accuracy, and complexity of different models and algorithms and find a model that works best for that particular problem.

## Statistics and Chemistry Lesson of the Day – Illustrating Basic Concepts in Experimental Design with the Synthesis of Ammonia

To summarize what we have learned about experimental design in the past few Applied Statistics Lessons of the Day, let’s use an example from physical chemistry to illustrate these basic principles.

Ammonia (NH3) is widely used as a fertilizer in industry.  It is commonly synthesized by the Haber process, which involves a reaction between hydrogen gas and nitrogen gas.

N2 + 3 H2 → 2 NH3   (ΔH = −92.4 kJ·mol−1)

Recall that ΔH is the change in enthalpy.  Under constant pressure (which is the case for most chemical reactions), ΔH is the heat absorbed or released by the system.

## Physical Chemistry Lesson of the Day – The Difference Between Changes in Enthalpy and Changes in Internal Energy

Let’s examine the difference between a change enthalpy and a change in internal energy.  It helps to think of the following 2 scenarios.

• If the chemical reaction releases a gas but occurs at constant volume, then there is no pressure-volume work.  The only way for energy to be transferred between the system and the surroundings is through heat.  An example of a system under constant volume is a bomb calorimeter.  In this case,

$\Delta H = \Delta U + P \Delta V = \Delta U + 0 = q - w + 0 = q - 0 + 0 = q$

This heat is denoted as $q_v$ to indicate that this is heat transferred under constant volume.  In this case, the change in enthalpy is the same as the change in internal energy.

• If the chemical reaction releases a gas and occurs at constant pressure, then energy can be transferred between the system and the surroundings through heat and/or work.  Thus,

$\Delta H = \Delta U + P \Delta V = q - w + P \Delta V = q$

This heat is denoted as $q_p$ to indicate that this is heat transferred under constant pressure.  Thus, as the gas forms inside the cylinder, the piston pushes against the constant pressure that the atmosphere exerts on it.  The total energy released by the chemical reaction allows some energy to be used for the pressure-volume work, with the remaining energy being released via heat.  (Recall that these are the 2 ways for internal energy to be changed according to the First Law of Thermodynamics.)  Thus, the difference between enthalpy and internal energy arises under constant pressure – the difference is the pressure-volume work.

Reactions under constant pressure are often illustrated by a reaction that releases a gas in cylinder with a movable piston, but they are actually quite common.  In fact, in chemistry, reactions under constant pressure are much more common than reactions under constant volume.  Chemical reactions often happen in beakers, flasks or any container open to the constant pressure of the atmosphere.

## Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

#### Introduction

Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the $Beta(2, 5)$ distribution actually integrates to 1 over its support set.  This post follows from my previous post on trapezoidal integration.

Image courtesy of Qef from

#### Conceptual Background of Rectangular Integration (a.k.a. The Midpoint Rule)

Rectangular integration is a numerical integration technique that approximates the integral of a function with a rectangle.  It uses rectangles to approximate the area under the curve.  Here are its features:

• The rectangle’s width is determined by the interval of integration.
• One rectangle could span the width of the interval of integration and approximate the entire integral.
• Alternatively, the interval of integration could be sub-divided into $n$ smaller intervals of equal lengths, and $n$ rectangles would used to approximate the integral; each smaller rectangle has the width of the smaller interval.
• The rectangle’s height is the function’s value at the midpoint of its base.
• Within a fixed interval of integration, the approximation becomes more accurate as more rectangles are used; each rectangle becomes narrower, and the height of the rectangle better captures the values of the function within that interval.

## Machine Learning Lesson of the Day – Cross-Validation

Validation is a good way to assess the predictive accuracy of a supervised learning algorithm, and the rule of thumb of using 70% of the data for training and 30% of the data for validation generally works well.  However, what if the data set is not very large, and the small amount of data for training results in high sampling error?  A good way to overcome this problem is K-fold cross-validation.

Cross-validation is best defined by describing its steps:

For each model under consideration,

1. Divide the data set into K partitions.
2. Designate the first partition as the validation set and designate the other partitions as the training set.
3. Use training set to train the algorithm.
4. Use the validation set to assess the predictive accuracy of the algorithm; the common measure of predictive accuracy is mean squared error.
5. Repeat Steps 2-4 for the second partition, third partition, … , the (K-1)th partition, and the Kth partition.  (Essentially, rotate the designation of validation set through every partition.)
6. Calculate the average of the mean squared error from all K validations.

Compare the average mean squared errors of all models and pick the one with the smallest average mean squared error as the best model.  Test all models on a separate data set (called the test set) to assess their predictive accuracies on new, fresh data.

If there are N data in the data set, and K = N, then this type of K-fold cross-validation has a special name: leave-one-out cross-validation (LOOCV).

There some trade-offs between a large and a small K.  The estimator for the prediction error from a larger K results in

• less bias because of more data being used for training
• higher variance because of the higher similarity and lower diversity between the training sets
• slower computation because of more data being used for training

In The Elements of Statistical Learning (2009 Edition, Chapter 7, Page 241-243), Hastie, Tibshirani and Friedman recommend 5 or 10 for K.

## Applied Statistics Lesson of the Day – The Completely Randomized Design with 1 Factor

The simplest experimental design is the completely randomized design with 1 factor.  In this design, each experimental unit is randomly assigned to a factor level.  This design is most useful for a homogeneous population (one that does not have major differences between any sub-populations).  It is appealing because of its simplicity and flexibility – it can be used for a factor with any number of levels, and different treatments can have different sample sizes.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.  The means of the response variable in the different groups are compared; if there are significant differences, then there is evidence to suggest that the factor and the response have a causal relationship.  The single-factor analysis of variance (ANOVA) model is most commonly used to analyze the data in such an experiment, but it does assume that the data in each group have a normal distribution, and that all groups have equal variance.  The Kruskal-Wallis test is a non-parametric alternative to ANOVA in analyzing data from single-factor completely randomized experiments.

If the factor has 2 levels, you may think that an independent 2-sample t-test with equal variance can also be used to analyze the data.  This is true, but the square of the t-test statistic in this case is just the F-test statistic in a single-factor ANOVA with 2 groups.  Thus, the results of these 2 tests are the same.  ANOVA generalizes the independent 2-sample t-test with equal variance to more than 2 groups.

Some textbooks state that “random assignment” means random assignment of experimental units to treatments, whereas other textbooks state that it means random assignment of treatments to experimental units.  I don’t think that there is any difference between these 2 definitions, but I welcome your thoughts in the comments.

## Physical Chemistry Lesson of the Day – Heat Capacity

The heat capacity of a system is the amount of heat required to increase the temperature of the system by 1 degree.  Heat is measured in joules (J) in the SI system, and heat capacity is dependent on each substance.  To make heat capacities comparable between substances, molar heat capacity or specific heat capacity are often used.

• Molar heat capacity is the amount of heat required to increase the temperature of 1 mole of a substance by 1 degree.
• Specific heat capacity is the amount of heat required to increase the temperature of 1 gram of a substance by 1 degree.

For example, over the range 0 to 100 degrees Celsius (or 273.15 to 373.15 degrees Kelvin), 4.18 J of heat on average is required to increase the temperature of 1 gram of water by 1 degree Kelvin.  Thus, the average specific heat capacity of water in that temperature range is 4.18 J/(g·K).

## Machine Learning Lesson of the Day – Parametric vs. Non-Parametric Models

A machine learning algorithm can be classified as either parametric or non-parametric.

A parametric algorithm has a fixed number of parameters.  A parametric algorithm is computationally faster, but makes stronger assumptions about the data; the algorithm may work well if the assumptions turn out to be correct, but it may perform badly if the assumptions are wrong.  A common example of a parametric algorithm is linear regression.

In contrast, a non-parametric algorithm uses a flexible number of parameters, and the number of parameters often grows as it learns from more data.  A non-parametric algorithm is computationally slower, but makes fewer assumptions about the data.  A common example of a non-parametric algorithm is K-nearest neighbour.

To summarize, the trade-offs between parametric and non-parametric algorithms are in computational cost and accuracy.

## Applied Statistics Lesson of the Day – Positive Control in Experimental Design

In my recent lesson on controlling for confounders in experimental design, the control group was described as one that received a neutral or standard treatment, and the standard treatment may simply be nothing.  This is a negative control group.  Not all experiments require a negative control group; some experiments instead have positive control group.

A positive control group is a group of experimental units that receive a treatment that is known to cause an effect on the response.  Such a causal relationship would have been previously established, and its inclusion in the experiment allows a new treatment to be compared to this existing treatment.  Again, both the positive control group and the experimental group experience the same experimental procedures and conditions except for the treatment.  The existing treatment with the known effect on the response is applied to the positive control group, and the new treatment with the unknown effect on the response is applied to the experimental group.  If the new treatment has a causal relationship with the response, both the positive control group and the experimental group should have the same responses.  (This assumes, of course, that the response can only be changed in 1 direction.  If the response can increase or decrease in value (or, more generally, change in more than 1 way), then it is possible for the positive control group and the experimental group to have the different responses.

In short, in an experiment with a positive control group, an existing treatment is known to “work”, and the new treatment is being tested to see if it can “work” just as well or even better.  Experiments to test for the effectiveness of a new medical therapies or a disease detector often have positive controls; there are existing therapies or detectors that work well, and the new therapy or detector is being evaluated for its effectiveness.

Experiments with positive controls are useful for ensuring that the experimental procedures and conditions proceed as planned.  If the positive control does not show the expected response, then something is wrong with the experimental procedures or conditions, and any “good” result from the new treatment should be considered with skepticism.

## How to Find a Job in Statistics – Advice for Students and Recent Graduates

#### Introduction

Most of this post focuses on soft skills that are needed to find any job; I dive specifically into advice for statisticians in the last section.  Although the soft skills are general and not specific to statisticians, many employers, veteran statisticians, and professors have told me that students and recent graduates would benefit from the focus on soft skills.  Thus, I discuss them first and leave the statistics-specific advice till the end.

## Physical Chemistry Lesson of the Day – Enthalpy

The enthalpy of a system is the system’s internal energy plus the product of the pressure and the volume of the system.

$H = U + PV$.

Just like internal energy, the enthalpy of a system cannot be measured, but a change in enthalpy can be measured.  Suppose that the only type of work that can be performed on the system is pressure-volume work; this is a realistic assumption in many chemical reactions that occur in a beaker, a flask, or any container that is open to the constant pressure of the atmosphere.  Then, the change in enthalpy of a system is the change in internal energy plus the pressure-volume work done on the system.

$\Delta H = \Delta U + P\Delta V$.

## Machine Learning Lesson of the Day – Babies and Non-Statisticians Practice Unsupervised Learning All the Time!

My recent lesson on unsupervised learning may make it seem like a rather esoteric field, with attempts to categorize it using words like “clustering“, “density estimation“, or “dimensionality reduction“.  However, unsupervised learning is actually how we as human beings often learn about the world that we live in – whether you are a baby learning what to eat or someone reading this blog.

• Babies use their mouths and their sense of taste to explore the world, and they can probably determine what satisfies their hunger and what doesn’t pretty quickly.  As they expose themselves to different objects – a formula bottle, a pacifier, a mother’s breast, their own fingers – their taste and digestive system are recognizing these inputs and detecting patterns of what satisfies their hunger and what doesn’t.  This all happens before they even fully understand what “food” or “hunger” means.  This will probably happen before someone says “This is food” to them and they have the language capacity to know what those 3 words mean.
• When a baby finall realizes what hunger feels like and develops the initiative to find something to eat, then that becomes a supervised learning problem: What attributes about an object will help me to determine if it’s food or not?
• I recent wrote a page called “About this Blog” to categorize the different types of posts that I have written on this blog so far.  I did not aim to predict anything about any blog post; I simply wanted to organize the 50-plus blog posts into a few categories and make it easier for you to find them.  I ultimately clustered my blog posts into 4 mutually exclusive categories (now with some overlaps).  You can think of each blog post as a vector-valued input, and I chose 2 elements – the length and the topic – of each vector to find a way to group them into classes that are very similar in length and topic within each class and very different in length and topic between the classes.  (I used those 2 elements – or features – to maximize the similarities within each category and minimized the dissimilarities between the 4 categories.)  There were other features that I could have used – whether it had an image (binary feature), the number of colours of the fonts (integer-valued feature), the time of publication of the post (continuous feature) – but length and topic were sufficient for me to arrive at the 4 categories of “Tutorials”, “Lessons”, “Advice”, and “Notifications about Presentations and Appearances at Upcoming Events”.

## Applied Statistics Lesson of the Day – Choosing the Range of Levels for Quantitative Factors in Experimental Design

In addition to choosing the number of levels for a quantitative factor in designing an experiment, the experimenter must also choose the range of the levels of the factor.

• If the levels are too close together, then there may not be a noticeable difference in the corresponding responses.
• If the levels are too far apart, then an important trend in the causal relationship could be missed.

Consider the following example of making sourdough bread from Gänzle et al. (1998).  The experimenters sought to determine the relationship between temperature and the growth rates of 2 strains of bacteria and 1 strain of yeast, and they used mathematical models and experimental data to study this relationship.  The plots below show the results for Lactobacillus sanfranciscensis LTH2581 (Panel A) and LTH1729 (Panel B), and Candida milleri LTH H198 (Panel C).  The figures contain the predicted curves (solid and dashed lines) and the actual data (circles).  Notice that, for all 3 organisms,

• the relationship is relatively “flat” in the beginning, so choosing temperatures that are too close together at low temperatures (e.g. 1 and 2 degrees Celsius) would not yield noticeably different growth rates
• the overall relationship between growth rate and temperature is rather complicated, and choosing temperatures that are too far apart might miss important trends.

Once again, the experimenter’s prior knowledge and hypothesis can be very useful in making this decision.  In this case, the experimenters had the benefit of their mathematical models in guiding their hypothesis and choosing the range of temperatures for collecting the data on the growth rates.

#### Reference:

Gänzle, Michael G., Michaela Ehmann, and Walter P. Hammes. “Modeling of growth of Lactobacillus sanfranciscensis and Candida milleri in response to process parameters of sourdough fermentation.” Applied and environmental microbiology 64.7 (1998): 2616-2623.

## Physical Chemistry Lesson of the Day: Pressure-Volume Work

In chemistry, a common type of work is the expansion or compression of a gas under constant pressure.  Recall from physics that pressure is defined as force applied per unit of area.

$P = F \div A$

$P \times A = F$

Consider a chemical reaction that releases a gas as its product inside a sealed cylinder with a movable piston.

Image from Dpumroy via Wikimedia.

As the gas expands inside the cylinder, it pushes against the piston, and work is done by the system against the surroundings.  The atmospheric pressure on the cylinder remains constant while the cylinder expands, and the volume of the cylinder increases as a result.  The volume of the cylinder at any given point is the area of the piston times the length of the cylinder.  The change in volume is equal to the area of the piston times the distance along which the piston was pushed by the expanding gas.

$w = -P \times \Delta V$

$w = -P \times A \times \Delta L$

$w = -F \times \Delta L$

Note that this last line is just the definition of work under constant force in the same direction as the displacement, multiplied by the negative sign to follow the sign convention in chemistry.

## Applied Statistics Lesson of the Day – Choosing the Number of Levels for Factors in Experimental Design

The experimenter needs to decide the number of levels for each factor in an experiment.

• For a qualitative (categorical) factor, the number of levels may simply be the number of categories for that factor.  However, because of cost constraints, an experimenter may choose to drop a certain category.  Based on the experimenter’s prior knowledge or hypothesis, the category with the least potential for showing a cause-and-effect relationship between the factor and the response should be dropped.
• For a quantitative (numeric) factor, the number of levels should reflect the cause-and-effect relationship between the factor and the response.  Again, the experimenter’s prior knowledge or hypothesis is valuable in making this decision.
• If the relationship in the chosen range of the factor is hypothesized to be roughly linear, then 2 levels (perhaps the minimum and the maximum) should be sufficient.
• If the relationship in the chosen range of the factor is hypothesized to be roughly quadratic, then 3 levels would be useful.  Often, 3 levels are enough.
• If the relationship in the chosen range of the factor is hypothesized to be more complicated than a quadratic relationship, consider using 4 or more levels.