## Machine Learning Lesson of the Day – K-Nearest Neighbours Regression

I recently introduced the K-nearest neighbours classifier.  Some slight adjustments to the same algorithm can make it into a regression technique.

Given a training set and a new input $X$, we can predict the target of the new input by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$ by Euclidean distance
2. build a linear regression model to predict the target for $X$
• the K data are the predictors
• the reciprocals of the predictors’ distances to $X$ are their respective regression coefficients (the “weights”)

Validation or cross-validation can be used to determine the best number of “K”.

## Useful Functions in R for Manipulating Text Data

#### Introduction

In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.)  In this post, I describe some common functions in R that I often use for text processing.

#### Obtaining Basic Information about Character Variables

In R, I often work with text data in the form of character variables.  To check if a variable is a character variable, use the is.character() function.

> year = 2014
> is.character(year)
[1] FALSE

If a variable is not a character variable, you can convert it to a character variable using the as.character() function.

> year.char = as.character(year)
> is.character(year.char)
[1] TRUE

A basic piece of information about a character variable is the number of characters that exist in this string.  Use the nchar() function to obtain this information.

> nchar(year.char)
[1] 4


## Applied Statistics Lesson of the Day – The Matched-Pair (or Paired) t-Test

My last lesson introduced the matched pairs experimental design, which is a special type of the randomized blocked design.  Let’s now talk about how to analyze the data from such a design.

Since the experimental units are organized in pairs, the units between pairs (blocks) are not independently assigned.  (The units within each pair are independently assigned – returning to the glove example, one hand is randomly chosen to wear the nitrile glove, while the other is randomly chosen to wear the latex glove.)  Because of this lack of independence between pairs, the independent 2-sample t-test is not applicable.  Instead, use the matched pair t-test (also called the paired or the paired difference t-test).  This is really a 1-sample t-test that tests the difference between the responses of the experimental and the control groups.

## Physical Chemistry Lesson of the Day – State Functions vs. Path Functions

Today’s lesson may seem mundane; despite its subtlety, it is actually quite important.  I needed to spend some time to learn it and digest it, and it was time well spent – these concepts are essential for understanding much of thermodynamics.  For brevity, I have not dived into the detailed mathematics of exact differentials, though I highly recommend you to learn it and review the necessary calculus.

Some thermodynamic properties of a system can be described by state variables, while others can be described by path variables.

A state variable is a variable that depends only on the final and initial states of a system and not on the path connecting these states.  Internal energy and enthalpy are examples of state functions.  For example, in a previous post on the First Law of Thermodynamics, I defined the change in internal energy, $\Delta U$, as

$\Delta U = \int_{i}^{f} dU = U_f - U_i$.

State variables can be calculated by exact differentials.

A path variable is a variable that depends on the sequence of steps that takes the system from the initial state to the final state.  This sequence of steps is called the path.  Heat and work are examples of path variables.  Path variables cannot be calculated by exact differentials.  In fact, the following quantities may seem to have plausible interpretations, but they actually do not exist:

• change in heat ($\Delta q$)
• initial heat ($q_i$)
• final heat ($q_f$)
• change in work ($\Delta w$)
• initial work ($w_i$)
• final work ($w_f$)

There is no such thing as heat or work being possessed by a system.  Heat and work can be transferred between the system and the surroundings, but the end result is an increase or decrease in internal energy; neither the system or the surroundings possesses heat or work.

A state/path variable is also often called a state/path function or a state/path quantity.

## Simon Fraser University Outstanding Alumni Awards – Wednesday, February 26, 2014 @ Four Seasons Hotel in Vancouver

I am delighted to have been invited to attend the SFU Outstanding Alumni Awards Dinner on this coming Wednesday, February 26!  I am grateful to the SFU Library for inviting me to this wonderful event to celebrate some remarkable graduates from my alma mater.  (I volunteered as a Learning and Writing Skills Peer Educator in the SFU Library’s Student Learning Commons for 7 years.  This was one of the most valuable experiences of my undergraduate education, and it is a pleasure to participate in this event with the Library.)

If you will attend this event, please do come to the Library’s table and say “Hello!”

Event Date:
Wednesday, February 26, 2014

Time:
Reception: 6:00pm
Dinner + Awards: 6:45pm

Location:
Four Seasons Hotel
791 West Georgia Street
Vancouver, BC

## Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

The K-nearest neighbours (KNN) classifier is a non-parametric classification technique that classifies an input $X$ by

1. identifying the K data (the K “neighbours”) in the training set that are closest to $X$
2. counting the number of “neighbours” that belong to each class of the target variable
3. classifying $X$ by the most common class to which its neighbours belong

K is usually an odd number to avoid resolving ties.

The proximity of the neighbours to $X$ is usually defined by Euclidean distance.

Validation or cross-validation can be used to determine the best number of “K”.

## Applied Statistics Lesson of the Day – The Matched Pairs Experimental Design

The matched pairs design is a special type of the randomized blocked design in experimental design.  It has only 2 treatment levels (i.e. there is 1 factor, and this factor is binary), and a blocking variable divides the $n$ experimental units into $n/2$ pairs.  Within each pair (i.e. each block), the experimental units are randomly assigned to the 2 treatment groups (e.g. by a coin flip).  The experimental units are divided into pairs such that homogeneity is maximized within each pair.

For example, a lab safety officer wants to compare the durability of nitrile and latex gloves for chemical experiments.  She wants to conduct an experiment with 30 nitrile gloves and 30 latex gloves to test her hypothesis.  She does her best to draw a random sample of 30 students in her university for her experiment, and they all perform the same organic synthesis using the same procedures to see which type of gloves lasts longer.

She could use a completely randomized design so that a random sample of 30 hands get the 30 nitrile gloves, and the other 30 hands get the 30 latex gloves.  However, since lab habits are unique to each person, this poses a confounding variable – durability can be affected by both the material and a student’s lab habits, and the lab safety officer only wants to study the effect of the material.  Thus, a randomized block design should be used instead so that each student acts as a blocking variable – 1 hand gets a nitrile glove, and 1 hand gets a latex glove.  Once the gloves have been given to the student, the type of glove is randomly assigned to each hand; some may get the nitrile glove on their left hand, and some may get it on their right hand.  Since this design involves one binary factor and blocks that divide the experimental units into pairs, this is a matched pairs design.

## Physical Chemistry Lesson of the Day – Standard Heats of Reaction

The change in enthalpy of a chemical reaction indicates how much heat is absorbed or released by the system.  This is valuable information in chemistry, because the exchange in heat affects the reaction conditions and the surroundings, and that needs to be managed and taken into account – in theory, in the laboratory, in industry or in nature in general.

Chemists often want to compare the changes in enthalpy between different reactions.  Since changes in enthalpy depend on both temperature and pressure, we need to control for these 2 confounding variables by using a reference set of temperature and pressure.  This set of conditions is called the standard conditions, and it sets the standard temperature at 298 degrees Kelvin and the standard pressure at 1 bar.  (IUPAC changed the definition of standard pressure from 1 atmosphere to 1 bar in 1982.  The actual difference in pressure between these 2 definitions is very small.)

The standard enthalpy of reaction (or standard heat of reaction) is the change in enthalpy of a chemical reaction under standard conditions; the actual number of moles are specified by the coefficients of the balanced chemical equation.  (Since enthalpy is an extensive property, the same reaction under standard conditions could have different changes in enthalpy with different amounts of the reactants and products.  Thus, the number of moles of the reaction must be standardized somehow when defining the standard enthalpy of reaction.)  The standard enthalpy of reaction has the symbol ΔHº; the º symbol indicates the standard conditions.

## The Chemical Statistician Celebrates its 1st Anniversary!

Dear Readers of The Chemical Statistician:

Yesterday was the 1st anniversary of The Chemical Statistician, and I am so glad that you have taken the time to visit my blog and allow me to share my passion about statistics, chemistry and machine learning with you.  I started this blog as a way to learn new things, document what I know, and understand concepts on a deeper level.  Little did I know that viewers from 146 countries would pay more than 34,000 visits to my blog in its first year!  (This does not include the thousands of views on my syndicated blog posts on AnalyticBridge!)

Since the beginning of 2014, I have significantly expanded the scope of The Chemical Statistician.  In addition to my classic and popular tutorials that often come with in-depth explanations and fully commented programming scripts, it now features

## A Pause from Blogging and a Call for Ideas – Improving The Chemical Statistician for its 2nd Year and Beyond.

Dear Readers of The Chemical Statistician,

Thank you all for your loyal viewership of my blog!  As The Chemical Statistician approaches its 1-year anniversary (February 17), I would like to take 2 weeks off from writing and review the direction of where my blog is going.  I am very proud of what I have accomplished so far, especially with this year’s expansion into daily lessons on weekdays, advice columns, and a new Youtube video channel.  However, there is a lot more that I want to accomplish with this blog, and I want to hear from you in this regard.  What do you like or dislike?  What topics would you like me to cover?  What format or delivery would you like me to add to the existing presentation of my blog?

Your comments are welcomed and most appreciated.  The wonderful feedback that I have received from you via LinkedIn, Twitter, AnalyticBridge, and this blog’s comments has been delightfully overwhelming.  Your positive encouragement provides a lot of motivation for me to continue and expand this blog, so thank you all for your generous support!

Eric