## Physical Chemistry Lesson of the Day – Effective Nuclear Charge

Much of chemistry concerns the interactions of the outermost electrons between different chemical species, whether they are atoms or molecules.  The properties of these outermost electrons depends in large part to the charge that the protons in the nucleus exerts on them.  Generally speaking, an atom with more protons exerts a larger positive charge.  However, with the exception of hydrogen, this positive charge is always less than the full nuclear charge.  This is due to the negative charge of the electrons in the inner shells, which partially offsets the positive charge from the nucleus.  Thus, the net charge that the nucleus exerts on the outermost electrons – the effective nuclear charge – is less than the charge that the nucleus would exert if there were no inner electrons between them.

## Machine Learning Lesson of the Day – Overfitting and Underfitting

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets.

In my experience with statistics and machine learning, I don’t encounter underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

## Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution.  This is not true.  If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if $X_1, X_2, ..., X_n$ is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

## Video Tutorial – The Hazard Function is the Probability Density Function Divided by the Survival Function

In an earlier video, I introduced the definition of the hazard function and broke it down into its mathematical components.  Recall that the definition of the hazard function for events defined on a continuous time scale is

$h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t]$.

Did you know that the hazard function can be expressed as the probability density function (PDF) divided by the survival function?

$h(t) = f(t) \div S(t)$

In my new Youtube video, I prove how this relationship can be obtained from the definition of the hazard function!  I am very excited to post this second video in my new Youtube channel.

## Applied Statistics Lesson of the Day – The Independent 2-Sample t-Test with Unequal Variances (Welch’s t-Test)

A common problem in statistics is determining whether or not the means of 2 populations are equal.  The independent 2-sample t-test is a popular parametric method to answer this question.  (In an earlier Statistics Lesson of the Day, I discussed how data collected from a completely randomized design with 1 binary factor can be analyzed by an independent 2-sample t-test.  I also discussed its possible use in the discovery of argon.)  I have learned 2 versions of the independent 2-sample t-test, and they differ on the variances of the 2 samples.  The 2 possibilities are

• equal variances
• unequal variances

Most statistics textbooks that I have read elaborate at length about the independent 2-sample t-test with equal variances (also called Student’s t-test).  However, the assumption of equal variances needs to be checked using the chi-squared test before proceeding with the Student’s t-test, yet this check does not seem to be universally done in practice.  Furthermore, conducting one test based on the results of another can inflate the probability of committing a Type 1 error (Ruxton, 2006).

Some books give due attention to the independent 2-sample t-test with unequal variances (also called Welch’s t-test), but some barely mention its value, and others do not even mention it at all.  I find this to be puzzling, because the assumption of equal variances is often violated in practice, and Welch’s t-test provides an easy solution to this problem.  There is a seemingly intimidating but straightforward calculation to approximate the number of degrees of freedom for Welch’s t-test, and this calculation is automatically incorporated in most software, including R and SAS.  Finally, Welch’s t-test removes the need to check for equal variances, and it is almost as powerful as Student’s t-test when the variances are equal (Ruxton, 2006).

For all of these reasons, I recommend Welch’s t-test when using the parametric approach to compare the means of 2 populations.

### Reference

Graeme D. Ruxton.  “The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test“.  Behavioral Ecology (July/August 2006) 17 (4): 688-690 first published online May 17, 2006

## Physical Chemistry Lesson of the Day – Standard Heats of Formation

The standard heat of formation, ΔHfº, of a chemical is the amount of heat absorbed or released from the formation of 1 mole of that chemical at 25 degrees Celsius and 1 bar from its elements in their standard states.  An element is in its standard state if it is in its most stable form and physical state (solid, liquid or gas) at 25 degrees Celsius and 1 bar.

For example, the standard heat of formation for carbon dioxide involves oxygen and carbon as the reactants.  Oxygen is most stable as O2 gas molecules, whereas carbon is most stable as solid graphite.  (Graphite is more stable than diamond under standard conditions.)

To phrase the definition in another way, the standard heat of formation is a special type of standard heat of reaction; the reaction is the formation of 1 mole of a chemical from its elements in their standard states under standard conditions.  The standard heat of formation is also called the standard enthalpy of formation (even though it really is a change in enthalpy).

By definition, the formation of an element from itself would yield no change in enthalpy, so the standard heat of reaction for all elements is zero.

## Machine Learning Lesson of the Day – Introduction to Linear Basis Function Models

Given a supervised learning problem of using $p$ inputs ($x_1, x_2, ..., x_p$) to predict a continuous target $Y$, the simplest model to use would be linear regression.  However, what if we know that the relationship between the inputs and the target is non-linear, but we are unsure of exactly what form this relationship has?

One way to overcome this problem is to use linear basis function models.  These models assume that the target is a linear combination of a set of $p+1$ basis functions.

$Y_i = w_0 + w_1 \phi_1(x_1) + w_2 \phi_2(x_2) + ... + w_p \phi_p(x_p)$

This is a generalization of linear regression that essentially replaces each input with a function of the input.  (A linear basis function model that uses the identity function is just linear regression.)

The type of basis functions (i.e. the type of function given by $\phi$) is chosen to suitably model the non-linearity in the relationship between the inputs and the target.  It also needs to be chosen so that the computation is efficient.  I will discuss variations of linear basis function models in a later Machine Learning Lesson of the Day.

## Applied Statistics Lesson of the Day – Additive Models vs. Interaction Models in 2-Factor Experimental Designs

In a recent “Machine Learning Lesson of the Day“, I discussed the difference between a supervised learning model in machine learning and a regression model in statistics.  In that lesson, I mentioned that a statistical regression model usually consists of a systematic component and a random component.  Today’s lesson strictly concerns the systematic component.

An additive model is a statistical regression model in which the systematic component is the arithmetic sum of the individual effects of the predictors.  Consider the simple case of an experiment with 2 factors.  If $Y$ is the response and $X_1$ and $X_2$ are the 2 predictors, then an additive linear model for the relationship between the response and the predictors is

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon$

In other words, the effect of $X_1$ on $Y$ does not depend on the value of $X_2$, and the effect of $X_2$ on $Y$ does not depend on the value of $X_1$.

In contrast, an interaction model is a statistical regression model in which the systematic component is not the arithmetic sum of the individual effects of the predictors.  In other words, the effect of $X_1$ on $Y$ depends on the value of $X_2$, or the effect of $X_2$ on $Y$ depends on the value of $X_1$.  Thus, such a regression model would have 3 effects on the response:

1. $X_1$
2. $X_2$
3. the interaction effect of $X_1$ and $X_2$

full factorial design with 2 factors uses the 2-factor ANOVA model, which is an example of an interaction model.  It assumes a linear relationship between the response and the above 3 effects.

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon$

Note that additive models and interaction models are not confined to experimental design; I have merely used experimental design to provide examples for these 2 types of models.

## A Story About Perseverance – Inspiration From My Old Professor

Names and details in this blog post have been altered to protect the privacy of its subjects.

I met my old professor, Dr. Perez, for lunch recently.  We have kept in touch for many years since she taught me during my undergraduate studies, and she has been a good friend and mentor.  We had not seen each other for a few years, but we have been in regular contact over phone and email, exchanging stories, updates, photos of her grandchildren, frustrations, thrills, and perspectives.  It was nice to see her again.

I told her about the accomplishments and the struggles in my early career as a statistician so far.  I am generally satisfied with how I have performed since my entry into the statistics profession, but there are many skills that I don’t have or need to improve upon.  I want to learn distributed computing and become better at programming in Python, SQL, and Hadoop – skills that are highly in demand in my industry but not taught during my statistics education.  I want to be better at communicating about statistics to non-statisticians – not only helping them to understand difficult concepts, but persuading them to follow my guidance when I know that I am right.  I sometimes even struggle with seemingly basic questions that require much thinking and research on my part to answer.  While all of these are likely common weaknesses that many young statisticians understandably have, they contribute to my feeling of incompetence on occasion  – and it’s not pleasant to perform below my, my colleagues’, or my industry’s expectations for myself.

Dr. Perez listened and provided helpful observations and advice.  While I am working hard and focusing on my specific problems at the moment, she gave me a broader, more long-term perspective about how best to overcome these struggles, and I really appreciated it.  Beyond this, however, she told me a story about a professor of our mutual acquaintance that stunned and saddened me, yet motivated me to continue to work harder.

## Physical Chemistry Lesson of the Day – Hess’s Law

Hess’s law states that the change in enthalpy of a multi-stage chemical reaction is just the sum of the changes of enthalpy of the individual stages.  Thus, if a chemical reaction can be written as a sum of multiple intermediate reactions, then its change in enthalpy can be easily calculated.  This is especially helpful for a reaction whose change in enthalpy is difficult to measure experimentally.

Hess’s law is a consequence of the fact that enthalpy is a state function; the path between the reactants and the products is irrelevant to the change in enthalpy – only the initial and final values matter.  Thus, if there is a path for which the intermediate values of $\Delta H$ are easy to obtain experimentally, then their sum equal the $\Delta H$ for the overall reaction.

## Machine Learning Lesson of the Day – Memory-Based Learning

Memory-based learning (also called instance-based learning) is a type of non-parametric algorithm that compares new test data with training data in order to solve the given machine learning problem.  Such algorithms search for the training data that are most similar to the test data and make predictions based on these similarities.  (From what I have learned, memory-based learning is used for supervised learning only.  Can you think of any memory-based algorithms for unsupervised learning?)

A distinguishing feature of memory-based learning is its storage of the entire training set.  This is computationally costly, especially if the training set is large – the storage itself is costly, and the complexity of the model grows with a larger data set.  However, it is advantageous because it uses less assumptions than parametric models, so it is adaptable to problems for which the assumptions may fail and no clear pattern is known ex ante.  (In contrast, parametric models like linear regression make generalizations about the training data; after building a model to predict the targets, the training data are discarded, so there is no need to store them.)  Thus, I recommend using memory-based learning algorithms when the data set is relatively small and there is no prior knowledge or information about the underlying patterns in the data.

Two classic examples of memory-based learning are K-nearest neighbours classification and K-nearest neighbours regression.

## Applied Statistics Lesson of the Day – The Full Factorial Design

An experimenter may seek to determine the causal relationships between $G$ factors and the response, where $G > 1$.  On first instinct, you may be tempted to conduct $G$ separate experiments, each using the completely randomized design with 1 factor.  Often, however, it is possible to conduct 1 experiment with $G$ factors at the same time.  This is better than the first approach because

• it is faster
• it uses less resources to answer the same questions
• the interactions between the $G$ factors can be examined

Such an experiment requires the full factorial design; in this design, the treatments are all possible combinations of all levels of all factors.  After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.

The simplest full factorial experiment consists of 2 factors, each with 2 levels.  Such an experiment would result in $2 \times 2 = 4$ treatments, each being a combination of 1 level from the first factor and 1 level from the second factor.  Since this is a full factorial design, experimental units are independently assigned to all treatments.  The 2-factor ANOVA model is commonly used to analyze data from such designs.

In later lessons, I will discuss interactions and 2-factor ANOVA in more detail.

## Physical Chemistry Lesson of the Day – The Perpetual Motion Machine

A thermochemical equation is a chemical equation that also shows the standard heat of reaction.  Recall that the value given by ΔHº is only true when the coefficients of the reactants and the products represent the number of moles of the corresponding substances.

The law of conservation of energy ensures that the standard heat of reaction for the reverse reaction of a thermochemical equation is just the forward reaction’s ΔHº multiplied by -1.  Let’s consider a thought experiment to show why this must be the case.

Imagine if a forward reaction is exothermic and has a ΔHº = -150 kJ, and its endothermic reverse reaction has a ΔHº = 100 kJ.  Then, by carrying out the exothermic forward reaction, 150 kJ is released from the reaction.  Out of that released heat, 100 kJ can be used to fuel the reverse reaction, and 50 kJ can be saved as a “profit” for doing something else, such as moving a machine.  This can be done perpetually, and energy can be created forever – of course, this has never been observed to happen, and the law of conservation of energy prevents such a perpetual motion machine from being made.  Thus, the standard heats of reaction for the forward and reverse reactions of the same thermochemical equation have the same magnitudes but opposite signs.

Regardless of how hard the reverse reaction may be to carry out, its ΔHº can still be written.