## How do Dew and Fog Form? Nature at Work with Temperature, Vapour Pressure, and Partial Pressure

In the early morning, especially here in Canada, I often see dew – water droplets formed by the condensation of water vapour on outside surfaces, like windows, car roofs, and leaves of trees.  I also sometimes see fog – water droplets or ice crystals that are suspended in air and often blocking visibility at great distances.  Have you ever wondered how they form?  It turns out that partial pressure, vapour pressure and temperature are the key phenomena at work.

Dew ( and Fog )

Source: Wikimedia

## Checking for Normality with Quantile Ranges and the Standard Deviation

#### Introduction

I was reading Michael Trosset’s “An Introduction to Statistical Inference and Its Applications with R”, and I learned a basic but interesting fact about the normal distribution’s interquartile range and standard deviation that I had not learned before.  This turns out to be a good way to check for normality in a data set.

In this post, I introduce several traditional ways of checking for normality (or goodness of fit in general), talk about the method that I learned from Trosset’s book, then build upon this method by possibly coming up with a new way to check for normality.  I have not fully established this idea, so I welcome your thoughts and ideas.

## Estimating the Decay Rate and the Half-Life of DDT in Trout – Applying Simple Linear Regression with Logarithmic Transformation

This blog post uses a function and a script written in R that were displayed in an earlier blog post.

#### Introduction

This is the second of a series of blog posts about simple linear regression; the first was written recently on some conceptual nuances and subtleties about this model.  In this blog post, I will use simple linear regression to analyze a data set with a logarithmic transformation and discuss how to make inferences on the regression coefficients and the means of the target on the original scale.  The data document the decay of dichlorodiphenyltrichloroethane (DDT) in trout in Lake Michigan; I found it on Page 49 in the book “Elements of Environmental Chemistry” by Ronald A. Hites.  Future posts will also be written on the chemical aspects of this topic, including the environmental chemistry of DDT and exponential decay in chemistry and, in particular, radiochemistry.

Dichlorodiphenyltrichloroethane (DDT)

Source: Wikimedia Commons

A serious student of statistics or a statistician re-learning the fundamentals like myself should always try to understand the math and the statistics behind a software’s built-in function rather than treating it like a black box.  This is especially worthwhile for a basic yet powerful tool like simple linear regression.  Thus, instead of simply using the lm() function in R, I will reproduce the calculations done by lm() with my own function and script (posted earlier on my blog) to obtain inferential statistics on the regression coefficients.  However, I will not write or explain the math behind the calculations; they are shown in my own function with very self-evident variable names, in case you are interested.  The calculations are arguably the most straightforward aspects of linear regression, and you can easily find the derivations and formulas on the web, in introductory or applied statistics textbooks, and in regression textbooks.

## My Own R Function and Script for Simple Linear Regression – An Illustration with Exponential Decay of DDT in Trout

Here is the function that I wrote for doing simple linear regression, as alluded to in my blog post about simple linear regression on log-transformed data on the decay of DDT concentration in trout in Lake Michigan.  My goal was to replicate the 4 columns of the output from applying summary() to the output of lm().

To use this file and this script,

1) I saved this file as “simple linear regression.r”.

2) In the same folder, I saved a script called “DDT trout regression.r” that used this function to implement simple linear regression on the log-transformed DDT data.

3) I used setwd() to change the working directory to the folder containing the function and the script.

4) I made sure “DDT trout regression.r” used the source() function to call my user-defined function for simple linear regression.

5) I ran “DDT trout regression.r”.

## Some Subtle and Nuanced Concepts about Simple Linear Regression

#### Introduction

This blog post will focus on some conceptual foundations of simple linear regression, a very common technique in statistics and a precursor for understanding multiple linear regression.  I will expose and clarify many nuances and subtleties that I did not fully absorb until my Master’s degree in statistics at the University of Toronto.

#### What is Simple Linear Regression?

Simple linear regression is a predictive model that uses a predictor variable (x) to predict a continuous target variable (Y).  It is a formal and rigorous way to express 2 fundamental components of a statistical predictive model.

1) For each value of x, there is a probability distribution of Y.

2) The means of the probability distributions for all values of Y vary with x in a systematic way.

Mathematically, the first component is reflected in a random error variable, and the second component is reflected in the constant that expresses the linear relationship between x and Y.  These two components add together to give the following mathematical model.

$Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \ \ \ i = 1,...,n$

$\varepsilon_i \sim Normal(0, \sigma^2)$

$\varepsilon_i \perp \varepsilon_j, \ \ \ \ \ i \neq j$

The last mathematical expression states that two different error terms are statistically independent.

Essentially, this model captures the tendency for Y to vary systematically with x.  The systematic part is the constant term, $\beta_0 + \beta_1 x_i$.  The tendency (rather than a direct relation) is reflected in the probability distribution of the error component.

Note that I capitalized the target Y because it is a random variable.  (It is a linear combination of the random error, so it is also a random variable.)  I used lower-case for the predictor because it is a constant in the model.

#### What are the Assumptions of Simple Linear Regression?

1) The predictor variable is a fixed constant with no random variation.  If you want to model the predictor as a random variable, use the errors-in-variables model (a.k.a. measurement errors model).

2) The target variable is a linear combination of the regression coefficients and the predictor.

3) The variance of the random error component is constant.  This assumptions is called homoscedasticity.

4) The random errors are independent of each other.

5) The regression coefficients are constants.  If you want to model the regression coefficients as random variables, use the random effects model.  If you want to include both fixed and random coefficients in your model, use the mixed effects model.  The documentation for PROX MIXED in SAS/STAT has a nice explanation of mixed effects model.  I also recommend the documentation for PROC GLM for more about the random effects model.

***6) The random errors are normally distributed with an expected value of 0 and a variance of $\sigma^2$.  As Assumption #3 states, this variance is constant for all $\varepsilon_i, \ i = 1,...,n$.

***This last assumption is not needed for the least-squares estimation of the regression coefficients.  However, it is needed for conducting statistical inference for the regression coefficients, such as testing hypotheses and constructing confidence intervals.

#### Important Clarifications about the Terminology

Let me clarify some common confusion about the 2 key terms in the name “simple linear regression”.

– It is called “simple” because it uses only one predictor, whereas multiple linear regression uses multiple predictors.  While it is relatively simple to understand, and while it is a simple model compared to other predictive models, there are many concepts and nuances behind linear regression that still makes it difficult to understand for many people.  (I hope that this blog post will make it easier to understand this model!)

– It is called “linear” because the target variable is linear with respect to the parameters $\beta_0$ and $\beta_1$ (the regression coefficients)not because it is linear with respect to the predictor; this is a very common misunderstanding, and I did not learn this until the second course in which I learned about linear regression.  This is more than just a naming custom; it implies that the regression coefficients can be estimated using linear algebra, which has many benefits that will be described in a later post.

Simple linear regression does assume that the target variable has a linear relationship with the predictor variable.  However, if it doesn’t, it can often be resolved – the predictor and/or the target can often be transformed to make the relationship linear.  If, however, the target variable cannot be written as a linear combination of the parameters $\beta_0$ and $\beta_1$, then the model is no longer linear regressioneven if the target is linear with respect to the predictor.

#### How are the Regression Coefficients Estimated?

The regression coefficients are estimated by finding values of $\beta_0$ and $\beta_1$ that minimize the sum of the squares of the deviations from the regression line to the data.  My first linear regression textbook, “Applied Linear Statistical Models” by Kutner, Nachtsheim, Neter, and Li uses the letter “Q” to denote this quantity.  This is called the method of least squares.  The word “minimize” should trigger finding the global optimizers using differential calculus.

$Q = \sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2$

Differentiate Q with respect to $\beta_0$ and $\beta_1$; set the 2 derivatives to zero to get the normal equations.  The estimates are obtained by solving this system of 2 equations.

#### Why is the Least-Squares Method Used to Estimate the Regression Coefficients?

A natural question arises: Why minimize the sum of the squares of the errors?  Why not minimize some other measure of the distances from the regression line to the data, like the sum of the absolute values of the errors?

$Q' = \sum_{i=1}^n |y_i - \beta_0 - \beta_1 x_i|$

The answer lies within the Gauss-Markov theorem, which guarantees some very attractive properties for the least-squares estimators of the regression coefficients:

– these estimators are unbiased

Thus, the least-squares estimators are both accurate and very precise.

Note that the Gauss-Markov theorem holds without Assumption #6 above, which states that the errors have a normal distribution with an expected value of zero and a variance of $\sigma^2$.

## Discovering Argon with the 2-Sample t-Test

I learned about Lord Rayleigh’s discovery of argon in my 2nd-year analytical chemistry class while reading “Quantitative Chemical Analysis” by Daniel Harris.  (William Ramsay was also responsible for this discovery.)  This is one of my favourite stories in chemistry; it illustrates how diligence in measurement can lead to an elegant and surprising discovery.  I find no evidence that Rayleigh and Ramsay used statistics to confirm their findings; their paper was published 13 years before Gosset published about the t-test.  Thus, I will use a 2-sample t-test in R to confirm their result.

Photos of Lord Rayleigh

Source: Wikimedia Commons

## Why Does Diabetes Cause Excessive Urination and Thirst? A Lesson on Osmosis

#### A TABA Seminar on Diabetes

I have the pleasure of being an executive member of the Toronto Applied Biostatistics Association (TABA), a volunteer-run professional organization here in Toronto that organizes seminars on biostatistics.  During this past Tuesday, Dr. Loren Grossman from the LMC Diabetes and Endocrinology Centre generously donated his time to deliver an introductory seminar on diabetes for biostatisticians.  The Institute for Clinical and Evaluative Sciences (ICES) at Sunnybrook Hospital kindly hosted us and provided the venue for the seminar.  As a chemist and a former pre-medical student who studied physiology, I really enjoyed this intellectual treat, especially since Loren was clear, informative, and very knowledgeable about the subject.

The blue circle is a global symbol for diabetes.

Source: Wikimedia Commons

## Adding Labels to Points in a Scatter Plot in R

#### What’s the Scatter?

A scatter plot displays the values of 2 variables for a set of data, and it is a very useful way to visualize data during exploratory data analysis, especially (though not exclusively) when you are interested in the relationship between a predictor variable and a target variable.  Sometimes, such data come with categorical labels that have important meanings, and the visualization of the relationship can be enhanced when these labels are attached to the data.

It is common practice to use a legend to label data that belong to a group, as I illustrated in a previous post on bar charts and pie charts.  However, what if every datum has a unique label, and there are many data in the scatter plot?  A legend would add unnecessary clutter in such situations.  Instead, it would be useful to write the label of each datum near its point in the scatter plot. I will show how to do this in R, illustrating the code with a built-in data set called LifeCycleSavings.