Machine Learning Lesson of the Day – Estimating Coefficients in Linear Gaussian Basis Function Models

Recently, I introduced linear Gaussian basis function models as a suitable modelling technique for supervised learning problems that involve non-linear relationships between the target and the predictors.  Recall that linear basis function models are generalizations of linear regression that regress the target on functions of the predictors, rather than the predictors themselves.  In linear regression, the coefficients are estimated by the method of least squares.  Thus, it is natural that the estimation of the coefficients in linear Gaussian basis function models is an extension of the method of least squares.

The linear Gaussian basis function model is

Y = \Phi \beta + \varepsilon,

where \Phi_{ij} = \phi_j (x_i).  In other words, \Phi is the design matrix, and the element in row i and column j of this design matrix is the i\text{th} predictor being evaluated in the j\text{th} basis function.  (In this case, there is 1 predictor per datum.)

Applying the method of least squares, the coefficient vector, \beta, can be estimated by

\hat{\beta} = (\Phi^{T} \Phi)^{-1} \Phi^{T} Y.

Note that this looks like the least-squares estimator for the coefficient vector in linear regression, except that the design matrix is not X, but \Phi.

If you are not familiar with how \hat{\beta} was obtained, I encourage you to review least-squares estimation and the derivation of the estimator of the coefficient vector in linear regression.

Applied Statistics Lesson of the Day – Notation for Fractional Factorial Designs

Fractional factorial designs use the L^{F-p} notation; unfortunately, this notation is not clearly explained in most textbooks or web sites about experimental design.  I hope that my explanation below is useful.

  • L is the number of levels in each factor; note that the L^{F-p} notation assumes that all factors have the same number of levels.
    • If a factor has 2 levels, then the levels are usually coded as +1 and -1.
    • If a factor has 3 levels, then the levels are usually coded as +1, 0, and -1.
  • F is the number of factors in the experiment
  • p is the number of times that the full factorial design is fractionated by L.  This number is badly explained by most textbooks and web sites that I have seen, because they simply say that p is the fraction – this is confusing, because a fraction has a numerator and a denominator, and p is just 1 number.  To clarify,
    • the fraction is L^{-p}
    • the number of treatments in the fractional factorial design is L^{-p} multiplied by the total possible number of treatments in the full factorial design, which is L^F.

If all L^F possible treatments are used in the experiment, then a full factorial design is used.  If a fractional factorial design is used instead, then L^{-p} denotes the fraction of the L^F treatments that is used.

Most factorial experiments use binary factors (i.e. factors with 2 levels, L = 2).  Thus,

  • if p = 1, then the fraction of treatments that is used is 2^{-1} = 1/2.
  • if p = 1, then the fraction of treatments that is used is 2^{-2} = 1/4.

This is why

  • a 2^{F-1} design is often called a half-fraction design.
  • a 2^{F-2} design is often called a quarter-fraction design.

However, most sources that I have read do not bother to mention that L can be greater than 2; experiments with 3-level factors are less frequent but still common.  Thus, the terms half-fraction design and half-quarter design only apply to binary factors.  If L = 3, then

  • a 3^{F-1} design uses one-third of all possible treatments.
  • a 3^{F-2} design uses one-ninth of all possible treatments.

Inorganic Chemistry Lesson of the Day – Coordination Complexes

A coordination complex is a compound that consists of Lewis bases bonded to a Lewis acid in its centre.  The charge of the complex can be neutral, positive, or negative; if the complex has a positive or a negative charge, then it is called a complex ion.  The Lewis acid is almost always a metal atom or a metal ion.  The Lewis bases are called ligands, and they are often covalently bonded to the Lewis acid.  Common ligands include carbon monoxide, water, and ammonia; what unifies them is the existence of at least one lone pair of electrons in their outermost energy level, and this lone pair of electrons is donated to the Lewis acid.

Some key terminology:

  • The donor atom is the atom within the ligand that is attached to the Lewis acid centre.
  • The coordination number is the number of donor atoms in the coordination complex.
  • The denticity of a ligand is the number of bonds that it forms with the Lewis acid centre.
    • If a ligand forms 1 bond with the Lewis acid centre, then it is monodentate (sometimes called unidentate).
    • If a ligand forms multiple bonds with the Lewis acid centre, then the coordination complex is polydentate.  For example, a bidentate ligand forms 2 bonds with the Lewis acid centre.

In later Inorganic Chemistry Lessons of the Day, I will only refer to coordination complexes with metal atoms or metal ions as the Lewis acid centres.

Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

According to the central limit theorem, if

  • n random variables, X_1, ..., X_n, are independent and identically distributed,
  • n is sufficiently large,

then the distribution of their sample mean, \bar{X_n}, is approximately normal, and this approximation is better as n increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of X_1, ..., X_n.  In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice.  This video focuses on the intuition rather than the mathematics of the CLT.  In a later video, I will discuss the technical details of the CLT and how it applies to this example.

You can also watch the video below the fold!

Read more of this post

Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations.  For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

triple box plots with patterns

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Read more of this post

Machine Learning Lesson of the Day – Linear Gaussian Basis Function Models

I recently introduced the use of linear basis function models for supervised learning problems that involve non-linear relationships between the predictors and the target.  A common type of basis function for such models is the Gaussian basis function.  This type of model uses the kernel of the normal (or Gaussian) probability density function (PDF) as the basis function.

\phi_j(x) = exp[-(x - \mu_j)^2 \div 2\sigma^2]

The \sigma in this basis function determines the spacing between the different basis functions that combine to form the model.

Notice that this is just the normal PDF without the scaling factor of 1/\sqrt{2\pi \sigma^2}; the scaling factor ensures that the normal PDF integrates to 1 over its support set.  In a linear basis function model, the regression coefficients are the weights for the basis functions, and these weights will scale Gaussian basis functions to fit the data that are local to \mu_j.  Thus, there is no need to include that scaling factor of 1/\sqrt{2\pi \sigma^2}, because the scaling is already being handled by the regression coefficients.

The Gaussian basis function model is useful because

  • it can model many non-linear relationships between the predictor and the target surprisingly well,
  • each basis function is non-zero over a very small interval and is zero everywhere else.  These local basis functions result in a very sparse design matrix (i.e. one with mostly zeros) that leads to much faster computation.

Applied Statistics Lesson of the Day – Fractional Factorial Design and the Sparsity-of-Effects Principle

Consider again an experiment that seeks to determine the causal relationships between G factors and the response, where G > 1.  Ideally, the sample size is large enough for a full factorial design to be used.  However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead.  Such a design assigns the experimental units to a select fraction of the treatments; these treatments are chosen carefully to investigate the most significant causal relationships, while leaving aside the insignificant ones.  

When, then, are the significant causal relationships?  According to the sparsity-of-effects principle, it is unlikely that complex, higher-order effects exist, and that the most important effects are the lower-order effects.  Thus, assign the experimental units so that main (1st-order) effects and the 2nd-order interaction effects can be investigated.  This may neglect the discovery of a few significant higher-order effects, but that is the compromise that a fractional factorial design makes when the sample size available is low and the number of possible treatments is high.  

Apologies for Illness-Induced Absence

Dear Readers of The Chemical Statistician,

I apologize for the lack of posts in the past while.  I have been struck by a minor but persistent flu, and I have not fully recovered.  Thank you for your continued readership and viewership, and I will return to blogging as soon as I am completely healthy again.

Eric

Physical Chemistry Lesson of the Day – Effective Nuclear Charge

Much of chemistry concerns the interactions of the outermost electrons between different chemical species, whether they are atoms or molecules.  The properties of these outermost electrons depends in large part to the charge that the protons in the nucleus exerts on them.  Generally speaking, an atom with more protons exerts a larger positive charge.  However, with the exception of hydrogen, this positive charge is always less than the full nuclear charge.  This is due to the negative charge of the electrons in the inner shells, which partially offsets the positive charge from the nucleus.  Thus, the net charge that the nucleus exerts on the outermost electrons – the effective nuclear charge – is less than the charge that the nucleus would exert if there were no inner elctrons between them.

Machine Learning Lesson of the Day – Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if X_1, X_2, ..., X_n is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

Video Tutorial – The Hazard Function is the Probability Density Function Divided by the Survival Function

In an earlier video, I introduced the definition of the hazard function and broke it down into its mathematical components.  Recall that the definition of the hazard function for events defined on a continuous time scale is

h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t].

Did you know that the hazard function can be expressed as the probability density function (PDF) divided by the survival function?

h(t) = f(t) \div S(t)

In my new Youtube video, I prove how this relationship can be obtained from the definition of the hazard function!  I am very excited to post this second video in my new Youtube channel.  You can also view the video below the fold!

Read more of this post

Applied Statistics Lesson of the Day – The Independent 2-Sample t-Test with Unequal Variances (Welch’s t-Test)

A common problem in statistics is determining whether or not the means of 2 populations are equal.  The independent 2-sample t-test is a popular parametric method to answer this question.  (In an earlier Statistics Lesson of the Day, I discussed how data collected from a completely randomized design with 1 binary factor can be analyzed by an independent 2-sample t-test.  I also discussed its possible use in the discovery of argon.)  I have learned 2 versions of the independent 2-sample t-test, and they differ on the variances of the 2 samples.  The 2 possibilities are

  • equal variances
  • unequal variances

Most statistics textbooks that I have read elaborate at length about the independent 2-sample t-test with equal variances (also called Student’s t-test).  However, the assumption of equal variances needs to be checked using the chi-squared test before proceeding with the Student’s t-test, yet this check does not seem to be universally done in practice.  Furthermore, conducting one test based on the results of another results possible inflation of Type 1 error (Ruxton, 2006).

Some books give due attention to the independent 2-sample t-test with unequal variances (also called Welch’s t-test), but some barely mention its value, and others do not even mention it at all.  I find this to be puzzling, because the assumption of equal variances is often violated in practice, and Welch’s t-test provides an easy solution to this problem.  There is a seemingly intimidating but straightforward calculation to approximate the number of degrees of freedom for Welch’s t-test, and this calculation is automatically incorporated in most software, including R and SAS.  Finally, Welch’s t-test removes the need to check for equal variances, and it is almost as powerful as Student’s t-test when the variances are equal (Ruxton, 2006).

For all of these reasons, I recommend Welch’s t-test when using the parametric approach to comparing the means of 2 populations.

 

Reference

Graeme D. Ruxton.  “The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test“.  Behavioral Ecology (July/August 2006) 17 (4): 688-690 first published online May 17, 2006

Physical Chemistry Lesson of the Day – Standard Heats of Formation

The standard heat of formation, ΔHfº, of a chemical is the amount of heat absorbed or released from the formation of 1 mole of that chemical at 25 degrees Celsius and 1 bar from its elements in their standard states.  An element is in its standard state if it is in its most stable form and physical state (solid, liquid or gas) at 25 degrees Celsius and 1 bar.

For example, the standard heat of formation for carbon dioxide involves oxygen and carbon as the reactants.  Oxygen is most stable as O2 gas molecules, whereas carbon is most stable as solid graphite.  (Graphite is more stable than diamond under standard conditions.)

To phrase the definition in another way, the standard heat of formation is a special type of standard heats of reaction; the reaction is the formation of 1 mole of a chemical from its elements in their standard states under standard conditions.  The standard heat of formation is also called the standard enthalpy of formation (even though it really is a change in enthalpy).

By definition, the formation of an element from itself would yield no change in enthalpy, so the standard heat of reaction for all elements is zero.

 

Machine Learning Lesson of the Day – Introduction to Linear Basis Function Models

Given a supervised learning problem of using p inputs (x_1, x_2, ..., x_p) to predict a continuous target Y, the simplest model to use would be linear regression.  However, what if we know that the relationship between the inputs and the target is non-linear, but we are unsure of exactly what form this relationship has?

One way to overcome this problem is to use linear basis function models.  These models assume that the target is a linear combination of a set of p+1 basis functions.

Y_i = w_0 + w_1 \phi_1(x_1) + w_2 \phi_2(x_2) + ... + w_p \phi_p(x_p)

This is a generalization of linear regression that essentially replaces each input with a function of the input.  (A linear basis function model that uses the identity function is just linear regression.)

The type of basis functions (i.e. the type of function given by \phi) is chosen to suitably model the non-linearity in the relationship between the inputs and the target.  It also needs to be chosen so that the computation is efficient.  I will discuss variations of linear basis function models in a later Machine Learning Lesson of the Day.

Applied Statistics Lesson of the Day – Additive Models vs. Interaction Models in 2-Factor Experimental Designs

In a recent “Machine Learning Lesson of the Day“, I discussed the difference between a supervised learning model in machine learning and a regression model in statistics.  In that lesson, I mentioned that a statistical regression model usually consists of a systematic component and a random component.  Today’s lesson strictly concerns the systematic component.

An additive model is a statistical regression model in which the systematic component is the arithmetic sum of the individual effects of the predictors.  Consider the simple case of an experiment with 2 factors.  If Y is the response and X_1 and X_2 are the 2 predictors, then an additive linear model for the relationship between the response and the predictors is

Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon

In other words, the effect of X_1 on Y does not depend on the value of X_2, and the effect of X_2 on Y does not depend on the value of X_1.

In contrast, an interaction model is a statistical regression model in which the systematic component is not the arithmetic sum of the individual effects of the predictors.  In other words, the effect of X_1 on Y depends on the value of X_2, or the effect of X_2 on Y depends on the value of X_1.  Thus, such a regression model would have 3 effects on the response:

  1. X_1
  2. X_2
  3. the interaction effect of X_1 and X_2

For example, a full factorial design with 2 factors uses the 2-factor ANOVA model and assumes a linear relationship between the response and the above 3 effects.

Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon

Note that additive models and interaction models are not confined to experimental design; I have merely used experimental design to provide examples for these 2 types of models.

A Story About Perseverance – Inspiration From My Old Professor

Names and details in this blog post have been altered to protect the privacy of its subjects.

I met my old professor, Dr. Perez, for lunch recently.  We have kept in touch for many years since she taught me during my undergraduate studies, and she has been a good friend and mentor.  We had not seen each other for a few years, but we have been in regular contact over phone and email, exchanging stories, updates, photos of her grandchildren, frustrations, thrills, and perspectives.  It was nice to see her again.

I told her about the accomplishments and the struggles in my early career as a statistician so far.  I am generally satisfied with how I have performed since my entry into the statistics profession, but there are many skills that I don’t have or need to improve upon.  I want to learn distributed computing and become better at programming in Python, SQL, and Hadoop – skills that are highly in demand in my industry but not taught during my statistics education.  I want to be better at communicating about statistics to non-statisticians – not only helping them to understand difficult concepts, but persuading them to follow my guidance when I know that I am right.  I sometimes even struggle with seemingly basic questions that require much thinking and research on my part to answer.  While all of these are likely common weaknesses that many young statisticians understandably have, they contribute to my feeling of incompetence on occasion  – and it’s not pleasant to perform below my, my colleagues’, or my industry’s expectations for myself.

Dr. Perez listened and provided helpful observations and advice.  While I am working hard and focusing on my specific problems at the moment, she gave me a broader, more long-term perspective about how best to overcome these struggles, and I really appreciated it.  Beyond this, however, she told me a story about a professor of our mutual acquaintance that stunned and saddened me, yet motivated me to continue to work harder.

Read more of this post

Physical Chemistry Lesson of the Day – Hess’s Law

Hess’s law states that the change in enthalpy of a multi-stage chemical reaction is just the sum of the changes of enthalpy of the individual stages.  Thus, if a chemical reaction can be written as a sum of multiple intermediate reactions, then its change in enthalpy can be easily calculated.  This is especially helpful for a reaction whose change in enthalpy is difficult to measure experimentally.

Hess’s law is a consequence of the fact that enthalpy is a state function; the path between the reactants and the products is irrelevant to the change in enthalpy – only the initial and final values matter.  Thus, if there is a path for which the intermediate values of \Delta H are easy to obtain experimentally, then their sum equal the \Delta H for the overall reaction.

 

Machine Learning Lesson of the Day – Memory-Based Learning

Memory-based learning (also called instance-based learning) is a type of non-parametric algorithm that compares new test data with training data in order to solve the given machine learning problem.  Such algorithms search for the training data that are most similar to the test data and make predictions based on these similarities.  (From what I have learned, memory-based learning is used for supervised learning only.  Can you think of any memory-based algorithms for unsupervised learning?)

A distinguishing feature of memory-based learning is its storage of the entire training set.  This is computationally costly, especially if the training set is large – the storage itself is costly, and the complexity of the model grows with a larger data set.  However, it is advantageous because it uses less assumptions than parametric models, so it is adaptable to problems for which the assumptions may fail and no clear pattern is known ex ante.  (In contrast, parametric models like linear regression make generalizations about the training data; after building a model to predict the targets, the training data are discarded, so there is no need to store them.)  Thus, I recommend using memory-based learning algorithms when the data set is relatively small and there is no prior knowledge or information about the underlying patterns in the data.

Two classic examples of memory-based learning are K-nearest neighbours classification and K-nearest neighbours regression.

Applied Statistics Lesson of the Day – The Full Factorial Design

An experimenter may seek to determine the causal relationships between G factors and the response, where G > 1.  On first instinct, you may be tempted to conduct G separate experiments, each using the completely randomized design with 1 factor.  Often, however, it is possible to conduct 1 experiment with G factors at the same time.  This is better than the first approach because

  • it is faster
  • it uses less resources to answer the same questions
  • the interactions between the G factors can be examined

Such an experiment requires the full factorial design.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.  

The simplest full factorial experiment consists of 2 factors, each with 2 levels.  Such an experiment would result in 2 \times 2 = 4 treatments, each being a combination of 1 level from the first factor and 1 level from the second factor.  Since this is a full factorial design, experimental units are independently assigned to all treatments.  The 2-factor ANOVA model is commonly used to analyze data from such designs.

In later lessons, I will discuss interactions and 2-factor ANOVA in more detail.

Follow

Get every new post delivered to your Inbox.

Join 251 other followers