Machine Learning Lesson of the Day – Using Validation to Assess Predictive Accuracy in Supervised Learning

Supervised learning puts a lot of emphasis on building a model that has high predictive accuracy.  Validation is a good method for assessing a model’s predictive accuracy.

Validation is the use of one part of your data set to build your model and another part of your data set to assess the model’s predictive accuracy.  Specifically,

  1. split your data set into 2 sets: a training set and a validation set
  2. use the training set to fit your model (e.g. LASSO regression)
  3. use the predictors in the validation set to predict the targets
  4. use some error measure (e.g mean squared error) to assess the differences between the predicted targets and the actual targets.

A good rule of thumb is to use 70% of your data for training and 30% of your data for your validation.

You should do this for several models (e.g. several different values of the penalty parameter in LASSO regression).  The model with the lowest mean squared error can be judged as the best model.

I highly encourage you to test your models on a separate data set – called a test set – from the same population or probability distribution and assess their predictive accuracies on the test set.  This is a good way to check for any overfitting in your models.

Machine Learning Lesson of the Day: Clustering, Density Estimation and Dimensionality Reduction

I struggle to categorize unsupervised learning.  It is not an easily defined field, and it is also hard to find generalizations of techniques that are exhaustive and mutually exclusive.

Nonetheless, here are some categories of unsupervised learning that cover many of its commonly used techniques.  I learned this categorization from Mathematical Monk, who posted a great set of videos on machine learning on Youtube.

  • Clustering: Categorize the observed variables X_1, X_2, ..., X_p into groups that maximize some similarity criterion, or, equivalently, minimize some dissimilarity criterion.
  • Density Estimation: Use statistical models to find an underlying probability distribution that gives rise to the observed variables.
  • Dimensionality Reduction: Find a smaller set of variables that captures the essential variations or patterns of the observed variables.  This smaller set of variables may be just a subset of the observed variables, or it may be a set of new variables that better capture the underlying variation of the observed variables.

Are there any other categories that you can think of?  How would you categorize hidden Markov models?  Your input is welcomed and appreciated in the comments!

Applied Statistics Lesson of the Day: Sample Size and Replication in Experimental Design

The goal of an experiment is to determine

  1. whether or not there is a cause-and-effect relationship between the factor and the response
  2. the strength of the causal relationship, should such a relationship exist.

To answer these questions, the response variable is measured in both the control group and the experimental group.  If there is a difference between the 2 responses, then there is evidence to suggest that the causal relationship exists, and the difference can be measured and quantified.

However, in most* experiments, there is random variation in the response.  Random variation exists in the natural sciences, and there is even more of it in the social sciences.  Thus, an observed difference between the control and experimental groups could be mistakenly attributed to a cause-and-effect relationship when the source of the difference is really just random variation.  In short, the difference may simply be due to the noise rather than the signal.  

To detect an actual difference beyond random variation (i.e. to obtain a higher signal-to-noise ratio), it is important to use replication to obtain a sufficiently large sample size in the experiment.  Replication is the repeated application of the treatments to multiple independently assigned experimental units.  (Recall that randomization is an important part of controlling for confounding variables in an experiment.  Randomization ensures that the experimental units are independently assigned to the different treatments.)  The number of independently assigned experimental units that receive the same treatment is the sample size.

*Deterministic computer experiments are unlike most experiments; they do not have random variation in the responses.

Physical Chemistry Lesson of the Day – The First Law of Thermodynamics

The change in internal energy of a system is defined to be the internal energy of a system in its final state subtracted by the internal energy of the system in its initial state.

\Delta U = U_{final} - U_{initial}.

However, since we cannot measure the internal energy of a system directly at any point in time, how can we calculate the change in internal energy?

The First Law of Thermodynamics states that any change in the internal energy of a system is equal to the heat absorbed the system plus any work done on the system.  Mathematically,

\Delta U = q + w.

Recall that I am using the sign convention in chemistry.

The value of q and w can be positive or negative.

  • A negative q denotes heat released by the system.
  • A negative w denotes work done by the system.

Physical Chemistry Lesson of the Day – Isolated, Closed and Open Systems

A thermodynamic system can be one of three types:

  • An isolated system cannot transfer energy or matter with its surroundings.  Other than the universe itself, an isolated system does not exist in practice.  However, a very well insulated and bounded system with negligible loss of heat is roughly an isolated system, especially when considered within a very short amount of time.  In mathematical and physical modelling, a hydrogen atom’s proton, a hydrogen atom’s electron and a planet are often treated each as an isolated system.
  • A closed system can transfer energy (heat or work) but not matter with its surroundings.
    • A system that allows work but not heat to be transferred with its surroundings is an adiabatically isolated system.
    • A system that allows heat but not work to be transferred with its surroundings is a mechanically isolated system.
  • An open system can transfer both energy and matter with its surroundings.

Applied Statistics Lesson of the Day – Basic Terminology in Experimental Design #2: Controlling for Confounders

A well designed experiment must have good control, which is the reduction of effects from confounding variables.  There are several ways to do so:

  • Include a control group.  This group will receive a neutral treatment or a standard treatment.  (This treatment may simply be nothing.)  The experimental group will receive the new treatment or treatment of interest.  The response in the experimental group will be compared to the response in the control group to assess the effect of the new treatment or treatment of interest.  Any effect from confounding variables will affect both the control group and the experimental group equally, so the only difference between the 2 groups should be due to the new treatment or treatment of interest.
  • In medical studies with patients as the experimental units, it is common to include a placebo group.  Patients in the placebo group get a treatment that is known to have no effect.  This accounts for the placebo effect.
    • For example, in a drug study, a patient in the placebo group may get a sugar pill.
  • In experiments with human or animal subjects, participants and/or the experimenters are often blinded.  This means that they do not know which treatment the participant received.  This ensures that knowledge of receiving a particular treatment – for either the participant or the experimenters – is not a confounding variable.  An experiment that blinds both the participants and the experimenters is called a double-blinded experiment.
  • For confounding variables that are difficult or impossible to control for, the experimental units should be assigned to the control group and the experimental group by randomization.  This can be done with random number tables, flipping a coin, or random number generators from computers.  This ensures that confounding effects affect both the control group and the experimental group roughly equally.
    • For example, an experimenter wants to determine if the HPV vaccine will make new students immune to HPV.  There will be 2 groups: the control group will not receive the vaccine, and the experimental group will receive the vaccine.  If the experimenter can choose students from 2 schools for her study, then the students should be randomly assigned into the 2 groups, so that each group will have roughly the same number of students from each school.  This would minimize the confounding effect of the schools.

Machine Learning Lesson of the Day – Supervised Learning: Classification and Regression

Supervised learning has 2 categories:

  • In classification, the target variable is categorical.
  • In regression, the target variable is continuous.

Thus, regression in statistics is different from regression in supervised learning.

In statistics,

  • regression is used to model relationships between predictors and targets, and the targets could be continuous or categorical.  
  • a regression model usually includes 2 components to describe such relationships:
    • a systematic component
    • a random component.  The random component of this relationship is mathematically described by some probability distribution.  
  • most regression models in statistics also have assumptions about the statistical independence or dependence between the predictors and/or between the observations.  
  • many statistical models also aim to provide interpretable relationships between the predictors and targets.  
    • For example, in simple linear regression, the slope parameter, \beta_1, predicts the change in the target, Y, for every unit increase in the predictor, X.

In supervised learning,

  • target variables in regression must be continuous
    • categorical target variables are modelled in classification
  • regression has less or even no emphasis on using probability to describe the random variation between the predictor and the target
    • Random forests are powerful tools for both classification and regression, but they do not use probability to describe the relationship between the predictors and the target.
  • regression has less or even no emphasis on providing interpretable relationships between the predictors and targets.  
    • Neural networks are powerful tools for both classification and regression, but they do not provide interpretable relationships between the predictors and the target.

***The last 2 points are applicable to classification, too.

In general, supervised learning puts much more emphasis on accurate prediction than statistics.

Since regression in supervised learning includes only continuous targets, this results in some confusing terminology between the 2 fields.  For example, logistic regression is a commonly used technique in both statistics and supervised learning.  However, despite its name, it is a classification technique in supervised learning, because the response variable in logistic regression is categorical.

Machine Learning Lesson of the Day – Supervised and Unsupervised Learning

The 2 most commonly used and studied categories of machine learning are supervised learning and unsupervised learning.

  • In supervised learning, there is a target variable, Y, and a set of predictor variables, X_1, X_2, ..., X_p.  The goal is to use X_1, X_2, ..., X_p to predict Y.  Supervised learning is synonymous with predictive modelling, but the latter term does not connote with learning from data to improve performance in future prediction.  Nonetheless, when I explain supervised learning to people who have some background in statistics or analytics, they usually understand what I mean when I tell them that it is just predictive modelling.
  • In unsupervised learning, there are only predictor variables and no target variable.  The goal is to find interesting patterns in X_1, X_2, ..., X_p.  This is a much less concretely defined problem than supervised learning.  Unsupervised learning is sometimes called pattern discovery, pattern recognition, or knowledge discovery, though these are not commonly agreed upon synonyms.

Physical Chemistry Lesson of the Day – Basic Terminology in Thermodynamics

A system is the part of the universe of interest, and the surroundings is everything else in the universe.

The internal energy of a system is the sum of the kinetic and potential energies of all of the particles (atoms and molecules) in the system.  This cannot be measured, but changes in internal energy can be measured.

There are 2 ways in which the internal energy of a system can change: heat and work.

  • Heat is the transfer of energy between 2 objects due to a temperature difference.  In chemistry, heat is commonly observed when a chemical reaction absorbs or releases energy.
  • Work is force acting over a distance.  In chemistry, a common type of work is the expansion or compression of a gas.

In chemistry, it is conventional to take the system’s point of view in deciding the sign of heat and work.  Thus, if heat is entering the system or if work is done on the system, then the sign is positive.  If heat is exiting the system of if work is done by the system, then the sign is negative.

Applied Statistics Lesson of the Day – Basic Terminology in Experimental Design #1

The word “experiment” can mean many different things in various contexts.  In science and statistics, it has a very particular and subtle definition, one that is not immediately familiar to many people who work outside of the field of experimental design. This is the first of a series of blog posts to clarify what an experiment is, how it is conducted, and why it is so central to science and statistics.

Experiment: A procedure to determine the causal relationship between 2 variables – an explanatory variable and a response variable.  The value of the explanatory variable is changed, and the value of the response variable is observed for each value of the explantory variable.

  • An experiment can have 2 or more explanatory variables and 2 or more response variables.
  • In my experience, I find that most experiments have 1 response variable, but many experiments have 2 or more explanatory variables.  The interactions between the multiple explanatory variables are often of interest.
  • All other variables are held constant in this process to avoid confounding.

Explanatory Variable or Factor: The variable whose values are set by the experimenter.  This variable is the cause in the hypothesis.  (*Many people call this the independent variable.  I discourage this usage, because “independent” means something very different in statistics.)

Response Variable: The variable whose values are observed by the experimenter as the explanatory variable’s value is changed.  This variable is the effect in the hypothesis.  (*Many people call this the dependent variable.  Further to my previous point about “independent variables”, dependence means something very different in statistics, and I discourage using this usage.)

Factor Level: Each possible value of the factor (explanatory variable).  A factor must have at least 2 levels.

Treatment: Each possible combination of factor levels.

  • If the experiment has only 1 explanatory variable, then each treatment is simply each factor level.
  • If the experiment has 2 explanatory variables, X and Y, then each treatment is a combination of 1 factor level from X and 1 factor level from Y.  Such combining of factor levels generalizes to experiments with more than 2 explanatory variables.

Experimental Unit: The object on which a treatment is applied.  This can be anything – person, group of people, animal, plant, chemical, guitar, baseball, etc.