January | 2014 | The Chemical Statistician

Physical Chemistry Lesson of the Day – Intensive vs. Extensive Properties

January 30, 2014 Leave a comment

An extensive property is a property that depends on the size of the system. Examples include

mass
volume
number of moles
energy
heat capacity

An intensive property is a property that does not depend on the size of the system. Examples include

pressure
temperature
concentration
specific heat capacity
molar heat capacity
energy per unit of volume or mass

As you can see, some intensive properties can be derived from extensive properties by dividing an extensive property by the mass, volume, or number of moles of the system.

Filed under Basic Chemistry, Chemistry, Chemistry Lesson of the Day, Physical Chemistry Tagged with concentration, energy, extensive property, heat capacity, intensive property, mass, molar heat capacity, moles, pressure, specific heat capacity, system, temperature, volume

Machine Learning Lesson of the Day – Overfitting

January 29, 2014 Leave a comment

Any model in statistics or machine learning aims to capture the underlying trend or systematic component in a data set. That underlying trend cannot be precisely captured because of the random variation in the data around that trend. A model must have enough complexity to capture that trend, but not too much complexity to capture the random variation. An overly complex model will describe the noise in the data in addition to capturing the underlying trend, and this phenomenon is known as overfitting.

Let’s illustrate overfitting with linear regression as an example.

A linear regression model with sufficient complexity has just the right number of predictors to capture the underlying trend in the target. If some new but irrelevant predictors are added to the model, then they “have nothing to do” – all the variation underlying the trend in the target has been captured already. Since they are now “stuck” in this model, they “start looking” for variation to capture or explain, but the only variation left over is the random noise. Thus, the new model with these added irrelevant predictors describes the trend and the noise. It predicts the targets in the training set extremely well, but very poorly for targets in any new, fresh data set – the model captures the noise that is unique to the training set.

(This above explanation used a parametric model for illustration, but overfitting can also occur for non-parametric models.)

To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.

Overfitting differs from multicollinearity, which I will explain in later post. Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.

Filed under Machine Learning, Machine Learning Lesson of the Day, Statistics Tagged with linear regression, machine learning, multicollinearity, overfitting, statistics

Applied Statistics Lesson of the Day – Blocking and the Randomized Complete Blocked Design (RCBD)

January 28, 2014 Leave a comment

A completely randomized design works well for a homogeneous population – one that does not have major differences between any sub-populations. However, what if a population is heterogeneous?

Consider an example that commonly occurs in medical studies. An experiment seeks to determine the effectiveness of a drug on curing a disease, and 100 patients are recruited for this double-blinded study – 50 are men, and 50 are women. An abundance of biological knowledge tells us that men and women have significantly physiologies, and this is a heterogeneous population with respect to gender. If a completely randomized design is used for this study, gender could be a confounding variable; this is especially true if the experimental group has a much higher proportion of one gender, and the control group has a much higher proportion of the other gender. (For instance, purely due to the randomness, 45 males may be assigned to the experimental group, and 45 females may be assigned to the control group.) If a statistically significant difference in the patients’ survival from the disease is observed between such a pair of experimental and control groups, this effect could be attributed to the drug or to gender, and that would ruin the goal of determining the cause-and-effect relationship between the drug and survival from the disease.

To overcome this heterogeneity and control for the effect of gender, a randomized blocked design could be used. Blocking is the division of the experimental units into homogeneous sub-populations before assigning treatments to them. A randomized blocked design for our above example would divide the males and females into 2 separate sub-populations, and then each of these 2 groups is split into the experimental and control group. Thus, the experiment actually has 4 groups:

25 men take the drug (experimental)
25 men take a placebo (control)
25 women take the drug (experimental)
25 women take a placebo (control)

Essentially, the population is divided into blocks of homogeneous sub-populations, and a completely randomized design is applied to each block. This minimizes the effect of gender on the response and increases the precision of the estimate of the effect of the drug.

Filed under Experimental Design, Statistics, Statistics Lesson of the Day Tagged with blinding, blocking, completely randomized design, confounder, confounding, confounding variable, control, design of experiments, DOE, double blinded, experiment, experimental design, placebo, precision, Randomized Complete Blocked Design, RCBD, statistics

Physical Chemistry Lesson of the Day – The Effect of Temperature on Changes in Internal Energy and Enthalpy

January 27, 2014 2 Comments

When the temperature of a system increases, the kinetic and potential energies of the atoms and molecules in the system increase. Thus, the internal energy of the system increases, which means that the enthalpy of the system increases – this is true under constant pressure or constant volume.

Recall that the heat capacity of a system is the amount of energy that is required to raise the system’s temperature by 1 degree Kelvin. Since the heat absorbed by the system in a thermodynamic process is the increase in enthalpy of the system, the heat capacity is just the change in enthalpy divided by the change in temperature.

$C = \Delta H \div \Delta T$ .

Filed under Basic Chemistry, Chemistry, Chemistry Lesson of the Day, Physical Chemistry Tagged with change in enthalpy, change in internal energy, enthalpy, heat capacity, internal energy, temperature

Video Tutorial: Breaking Down the Definition of the Hazard Function

January 27, 2014 8 Comments

The hazard function is a fundamental quantity in survival analysis. For an event occurring at some time on a continuous time scale, the hazard function, $h(t)$ , for that event is defined as

$h(t) = \lim_{\Delta t \rightarrow 0} [P(t < X \leq t + \Delta t \ | \ X > t) \ \div \ \Delta t]$ ,

where

$t$ is the time,
$X$ is the time of the occurrence of the event.

However, what does this actually mean? In this Youtube video, I break down the mathematics of this definition into its individual components and explain the intuition behind each component.

I am very excited about the release of this first video in my new Youtube channel! This is yet another mode of expansion of The Chemical Statistician since the beginning of 2014. As always, your comments are most appreciated!

Filed under Biostatistics, Mathematics, Probability, Statistics, Survival Analysis, Tutorials, Video Tagged with biostatistics, hazard function, math, mathematics, probability, survival analysis

Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

January 24, 2014 16 Comments

A model is a simplified representation of reality, and the simplifications are made to discard unnecessary detail and allow us to focus on the aspect of reality that we want to understand. These simplifications are grounded on assumptions; these assumptions may hold in some situations, but may not hold in other situations. This implies that a model that explains a certain situation well may fail in another situation. In both statistics and machine learning, we need to check our assumptions before relying on a model.

The “No Free Lunch” theorem states that there is no one model that works best for every problem. The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem. This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model. A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

Depending on the problem, it is important to assess the trade-offs between speed, accuracy, and complexity of different models and algorithms and find a model that works best for that particular problem.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with cross-validation, gradient descent, least squares regression, linear regression, machine learning, model, supervised learning, validation

Statistics and Chemistry Lesson of the Day – Illustrating Basic Concepts in Experimental Design with the Synthesis of Ammonia

January 23, 2014 Leave a comment

To summarize what we have learned about experimental design in the past few Applied Statistics Lessons of the Day, let’s use an example from physical chemistry to illustrate these basic principles.

Ammonia (NH₃) is widely used as a fertilizer in industry. It is commonly synthesized by the Haber process, which involves a reaction between hydrogen gas and nitrogen gas.

N₂ + 3 H₂ → 2 NH₃ (ΔH = −92.4 kJ·mol⁻¹)

Recall that ΔH is the change in enthalpy. Under constant pressure (which is the case for most chemical reactions), ΔH is the heat absorbed or released by the system.

Physical Chemistry Lesson of the Day – The Difference Between Changes in Enthalpy and Changes in Internal Energy

January 20, 2014 Leave a comment

Let’s examine the difference between a change enthalpy and a change in internal energy. It helps to think of the following 2 scenarios.

If the chemical reaction releases a gas but occurs at constant volume, then there is no pressure-volume work. The only way for energy to be transferred between the system and the surroundings is through heat. An example of a system under constant volume is a bomb calorimeter. In this case,

$\Delta H = \Delta U + P \Delta V = \Delta U + 0 = q - w + 0 = q - 0 + 0 = q$

This heat is denoted as $q_v$ to indicate that this is heat transferred under constant volume. In this case, the change in enthalpy is the same as the change in internal energy.

If the chemical reaction releases a gas and occurs at constant pressure, then energy can be transferred between the system and the surroundings through heat and/or work. Thus,

$\Delta H = \Delta U + P \Delta V = q - w + P \Delta V = q$

This heat is denoted as $q_p$ to indicate that this is heat transferred under constant pressure. Thus, as the gas forms inside the cylinder, the piston pushes against the constant pressure that the atmosphere exerts on it. The total energy released by the chemical reaction allows some energy to be used for the pressure-volume work, with the remaining energy being released via heat. (Recall that these are the 2 ways for internal energy to be changed according to the First Law of Thermodynamics.) Thus, the difference between enthalpy and internal energy arises under constant pressure – the difference is the pressure-volume work.

Reactions under constant pressure are often illustrated by a reaction that releases a gas in cylinder with a movable piston, but they are actually quite common. In fact, in chemistry, reactions under constant pressure are much more common than reactions under constant volume. Chemical reactions often happen in beakers, flasks or any container open to the constant pressure of the atmosphere.

Filed under Basic Chemistry, Chemistry, Chemistry Lesson of the Day, Physical Chemistry Tagged with chemistry, enthalpy, internal energy, physical chemistry, pressure, pressure-volume work, thermodynamics, volume, work

Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

January 20, 2014 3 Comments

Introduction

Continuing on the recently born series on numerical integration, this post will introduce rectangular integration. I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the $Beta(2, 5)$ distribution actually integrates to 1 over its support set. This post follows from my previous post on trapezoidal integration.

Image courtesy of Qef from Wikimedia Commons.

Conceptual Background of Rectangular Integration (a.k.a. The Midpoint Rule)

Rectangular integration is a numerical integration technique that approximates the integral of a function with a rectangle. It uses rectangles to approximate the area under the curve. Here are its features:

The rectangle’s width is determined by the interval of integration.
- One rectangle could span the width of the interval of integration and approximate the entire integral.
- Alternatively, the interval of integration could be sub-divided into $n$ smaller intervals of equal lengths, and $n$ rectangles would used to approximate the integral; each smaller rectangle has the width of the smaller interval.
The rectangle’s height is the function’s value at the midpoint of its base.
Within a fixed interval of integration, the approximation becomes more accurate as more rectangles are used; each rectangle becomes narrower, and the height of the rectangle better captures the values of the function within that interval.

Machine Learning Lesson of the Day – Cross-Validation

January 17, 2014 Leave a comment

Validation is a good way to assess the predictive accuracy of a supervised learning algorithm, and the rule of thumb of using 70% of the data for training and 30% of the data for validation generally works well. However, what if the data set is not very large, and the small amount of data for training results in high sampling error? A good way to overcome this problem is K-fold cross-validation.

Cross-validation is best defined by describing its steps:

For each model under consideration,

Divide the data set into K partitions.
Designate the first partition as the validation set and designate the other partitions as the training set.
Use training set to train the algorithm.
Use the validation set to assess the predictive accuracy of the algorithm; the common measure of predictive accuracy is mean squared error.
Repeat Steps 2-4 for the second partition, third partition, … , the (K-1)th partition, and the Kth partition. (Essentially, rotate the designation of validation set through every partition.)
Calculate the average of the mean squared error from all K validations.

Compare the average mean squared errors of all models and pick the one with the smallest average mean squared error as the best model. Test all models on a separate data set (called the test set) to assess their predictive accuracies on new, fresh data.

If there are N data in the data set, and K = N, then this type of K-fold cross-validation has a special name: leave-one-out cross-validation (LOOCV).

There some trade-offs between a large and a small K. The estimator for the prediction error from a larger K results in

less bias because of more data being used for training
higher variance because of the higher similarity and lower diversity between the training sets
slower computation because of more data being used for training

In The Elements of Statistical Learning (2009 Edition, Chapter 7, Page 241-243), Hastie, Tibshirani and Friedman recommend 5 or 10 for K.

Filed under Machine Learning, Machine Learning Lesson of the Day, Predictive Modelling, Statistics Tagged with bias, computational cost, computational speed, cross-validation, K-fold cross-validation, machine learning, mean squared error, predictive accuracy, supervised learning, test set, training set, validation, validation set, variance

Applied Statistics Lesson of the Day – The Completely Randomized Design with 1 Factor

January 16, 2014 Leave a comment

The simplest experimental design is the completely randomized design with 1 factor. In this design, each experimental unit is randomly assigned to a factor level. This design is most useful for a homogeneous population (one that does not have major differences between any sub-populations). It is appealing because of its simplicity and flexibility – it can be used for a factor with any number of levels, and different treatments can have different sample sizes. After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected. The means of the response variable in the different groups are compared; if there are significant differences, then there is evidence to suggest that the factor and the response have a causal relationship. The single-factor analysis of variance (ANOVA) model is most commonly used to analyze the data in such an experiment, but it does assume that the data in each group have a normal distribution, and that all groups have equal variance. The Kruskal-Wallis test is a non-parametric alternative to ANOVA in analyzing data from single-factor completely randomized experiments.

If the factor has 2 levels, you may think that an independent 2-sample t-test with equal variance can also be used to analyze the data. This is true, but the square of the t-test statistic in this case is just the F-test statistic in a single-factor ANOVA with 2 groups. Thus, the results of these 2 tests are the same. ANOVA generalizes the independent 2-sample t-test with equal variance to more than 2 groups.

Some textbooks state that “random assignment” means random assignment of experimental units to treatments, whereas other textbooks state that it means random assignment of treatments to experimental units. I don’t think that there is any difference between these 2 definitions, but I welcome your thoughts in the comments.

Filed under Applied Statistics, Experimental Design, Mathematical Statistics, Statistics, Statistics Lesson of the Day Tagged with ANOVA, completely randomized design, completely randomized experiment, design of experiment, design of experiments, DOE, experiment, experimental unit, F-distribution, F-test, factor, factor level, independent 2-sample t-test, Kruskal-Wallis test, non-parametric, non-parametric statistics, normal distribution, one-way ANOVA, single-factor ANOVA, statistics, t-distribution, t-test, treatment

Physical Chemistry Lesson of the Day – Heat Capacity

January 15, 2014 Leave a comment

The heat capacity of a system is the amount of heat required to increase the temperature of the system by 1 degree. Heat is measured in joules (J) in the SI system, and heat capacity is dependent on each substance. To make heat capacities comparable between substances, molar heat capacity or specific heat capacity are often used.

Molar heat capacity is the amount of heat required to increase the temperature of 1 mole of a substance by 1 degree.
Specific heat capacity is the amount of heat required to increase the temperature of 1 gram of a substance by 1 degree.

For example, over the range 0 to 100 degrees Celsius (or 273.15 to 373.15 degrees Kelvin), 4.18 J of heat on average is required to increase the temperature of 1 gram of water by 1 degree Kelvin. Thus, the average specific heat capacity of water in that temperature range is 4.18 J/(g·K).

Filed under Basic Chemistry, Chemistry, Chemistry Lesson of the Day, Physical Chemistry Tagged with calorimetry, chemistry, heat, heat capacity, molar heat capacity, physical chemistry, specific heat capacity, temperature

Machine Learning Lesson of the Day – Parametric vs. Non-Parametric Models

January 14, 2014 7 Comments

A machine learning algorithm can be classified as either parametric or non-parametric.

A parametric algorithm has a fixed number of parameters. A parametric algorithm is computationally faster, but makes stronger assumptions about the data; the algorithm may work well if the assumptions turn out to be correct, but it may perform badly if the assumptions are wrong. A common example of a parametric algorithm is linear regression.

In contrast, a non-parametric algorithm uses a flexible number of parameters, and the number of parameters often grows as it learns from more data. A non-parametric algorithm is computationally slower, but makes fewer assumptions about the data. A common example of a non-parametric algorithm is K-nearest neighbour.

To summarize, the trade-offs between parametric and non-parametric algorithms are in computational cost and accuracy.

Filed under Machine Learning, Machine Learning Lesson of the Day, Statistics Tagged with computational cost, computational speed, K-nearest neighbour, linear regression, machine learning, non-parametric, parametric

Applied Statistics Lesson of the Day – Positive Control in Experimental Design

January 13, 2014 Leave a comment

In my recent lesson on controlling for confounders in experimental design, the control group was described as one that received a neutral or standard treatment, and the standard treatment may simply be nothing. This is a negative control group. Not all experiments require a negative control group; some experiments instead have positive control group.

A positive control group is a group of experimental units that receive a treatment that is known to cause an effect on the response. Such a causal relationship would have been previously established, and its inclusion in the experiment allows a new treatment to be compared to this existing treatment. Again, both the positive control group and the experimental group experience the same experimental procedures and conditions except for the treatment. The existing treatment with the known effect on the response is applied to the positive control group, and the new treatment with the unknown effect on the response is applied to the experimental group. If the new treatment has a causal relationship with the response, both the positive control group and the experimental group should have the same responses. (This assumes, of course, that the response can only be changed in 1 direction. If the response can increase or decrease in value (or, more generally, change in more than 1 way), then it is possible for the positive control group and the experimental group to have the different responses.

In short, in an experiment with a positive control group, an existing treatment is known to “work”, and the new treatment is being tested to see if it can “work” just as well or even better. Experiments to test for the effectiveness of a new medical therapies or a disease detector often have positive controls; there are existing therapies or detectors that work well, and the new therapy or detector is being evaluated for its effectiveness.

Experiments with positive controls are useful for ensuring that the experimental procedures and conditions proceed as planned. If the positive control does not show the expected response, then something is wrong with the experimental procedures or conditions, and any “good” result from the new treatment should be considered with skepticism.

Filed under Applied Statistics, Experimental Design, Statistics, Statistics Lesson of the Day Tagged with applied statistics, confounder, confounders, control group, design of experiment, design of experiments, DOE, experimental design, negative control, positive control, statistics

How to Find a Job in Statistics – Advice for Students and Recent Graduates

January 13, 2014 34 Comments

Introduction

A graduate student in statistics recently asked me for advice on how to find a job in our industry. I’m happy to share my advice about this, and I hope that my advice can help you to find a satisfying job and develop an enjoyable career. My perspectives would be most useful to students and recent graduates because of my similar but unique background; I graduated only 1.5 years ago from my Master’s degree in statistics at the University of Toronto, and I volunteered as a career advisor at Simon Fraser University during my Bachelor’s degree. My advice will reflect my experience in finding a job in Toronto, but you can probably find parallels in your own city.

Most of this post focuses on soft skills that are needed to find any job; I dive specifically into advice for statisticians in the last section. Although the soft skills are general and not specific to statisticians, many employers, veteran statisticians, and professors have told me that students and recent graduates would benefit from the focus on soft skills. Thus, I discuss them first and leave the statistics-specific advice till the end.

Physical Chemistry Lesson of the Day – Enthalpy

January 10, 2014 Leave a comment

The enthalpy of a system is the system’s internal energy plus the product of the pressure and the volume of the system.

$H = U + PV$ .

Just like internal energy, the enthalpy of a system cannot be measured, but a change in enthalpy can be measured. Suppose that the only type of work that can be performed on the system is pressure-volume work; this is a realistic assumption in many chemical reactions that occur in a beaker, a flask, or any container that is open to the constant pressure of the atmosphere. Then, the change in enthalpy of a system is the change in internal energy plus the pressure-volume work done on the system.

$\Delta H = \Delta U + P\Delta V$ .

Filed under Chemistry, Chemistry Lesson of the Day, Physical Chemistry Tagged with chemistry, constant pressure, constant volume, enthalpy, heat, internal energy, physical chemistry, pressure-volume work, thermodynamics, volume, work

Machine Learning Lesson of the Day – Babies and Non-Statisticians Practice Unsupervised Learning All the Time!

January 9, 2014 1 Comment

My recent lesson on unsupervised learning may make it seem like a rather esoteric field, with attempts to categorize it using words like “clustering“, “density estimation“, or “dimensionality reduction“. However, unsupervised learning is actually how we as human beings often learn about the world that we live in – whether you are a baby learning what to eat or someone reading this blog.

Babies use their mouths and their sense of taste to explore the world, and they can probably determine what satisfies their hunger and what doesn’t pretty quickly. As they expose themselves to different objects – a formula bottle, a pacifier, a mother’s breast, their own fingers – their taste and digestive system are recognizing these inputs and detecting patterns of what satisfies their hunger and what doesn’t. This all happens before they even fully understand what “food” or “hunger” means. This will probably happen before someone says “This is food” to them and they have the language capacity to know what those 3 words mean.
- When a baby finall realizes what hunger feels like and develops the initiative to find something to eat, then that becomes a supervised learning problem: What attributes about an object will help me to determine if it’s food or not?
I recent wrote a page called “About this Blog” to categorize the different types of posts that I have written on this blog so far. I did not aim to predict anything about any blog post; I simply wanted to organize the 50-plus blog posts into a few categories and make it easier for you to find them. I ultimately clustered my blog posts into 4 ~~mutually exclusive~~ categories (now with some overlaps). You can think of each blog post as a vector-valued input, and I chose 2 elements – the length and the topic – of each vector to find a way to group them into classes that are very similar in length and topic within each class and very different in length and topic between the classes. (I used those 2 elements – or features – to maximize the similarities within each category and minimized the dissimilarities between the 4 categories.) There were other features that I could have used – whether it had an image (binary feature), the number of colours of the fonts (integer-valued feature), the time of publication of the post (continuous feature) – but length and topic were sufficient for me to arrive at the 4 categories of “Tutorials”, “Lessons”, “Advice”, and “Notifications about Presentations and Appearances at Upcoming Events”.

Filed under Machine Learning, Machine Learning Lesson of the Day, Statistics Tagged with babies, baby, cluster, clustering, dissimilarity, feature, machine learning, pattern detection, pattern recognition, similarity, unsupervised learning

Applied Statistics Lesson of the Day – Choosing the Range of Levels for Quantitative Factors in Experimental Design

January 8, 2014 Leave a comment

In addition to choosing the number of levels for a quantitative factor in designing an experiment, the experimenter must also choose the range of the levels of the factor.

If the levels are too close together, then there may not be a noticeable difference in the corresponding responses.
If the levels are too far apart, then an important trend in the causal relationship could be missed.

Consider the following example of making sourdough bread from Gänzle et al. (1998). The experimenters sought to determine the relationship between temperature and the growth rates of 2 strains of bacteria and 1 strain of yeast, and they used mathematical models and experimental data to study this relationship. The plots below show the results for Lactobacillus sanfranciscensis LTH2581 (Panel A) and LTH1729 (Panel B), and Candida milleri LTH H198 (Panel C). The figures contain the predicted curves (solid and dashed lines) and the actual data (circles). Notice that, for all 3 organisms,

the relationship is relatively “flat” in the beginning, so choosing temperatures that are too close together at low temperatures (e.g. 1 and 2 degrees Celsius) would not yield noticeably different growth rates
the overall relationship between growth rate and temperature is rather complicated, and choosing temperatures that are too far apart might miss important trends.

Once again, the experimenter’s prior knowledge and hypothesis can be very useful in making this decision. In this case, the experimenters had the benefit of their mathematical models in guiding their hypothesis and choosing the range of temperatures for collecting the data on the growth rates.

Reference:

Gänzle, Michael G., Michaela Ehmann, and Walter P. Hammes. “Modeling of growth of Lactobacillus sanfranciscensis and Candida milleri in response to process parameters of sourdough fermentation.” Applied and environmental microbiology 64.7 (1998): 2616-2623.

Filed under Applied Statistics, Experimental Design, Statistics, Statistics Lesson of the Day Tagged with design of experiments, DOE, experiment, experimental design, factor, factor level, factor levels, range of factor levels

Physical Chemistry Lesson of the Day: Pressure-Volume Work

January 7, 2014 Leave a comment

In chemistry, a common type of work is the expansion or compression of a gas under constant pressure. Recall from physics that pressure is defined as force applied per unit of area.

$P = F \div A$

$P \times A = F$

Consider a chemical reaction that releases a gas as its product inside a sealed cylinder with a movable piston.

Image from Dpumroy via Wikimedia.

As the gas expands inside the cylinder, it pushes against the piston, and work is done by the system against the surroundings. The atmospheric pressure on the cylinder remains constant while the cylinder expands, and the volume of the cylinder increases as a result. The volume of the cylinder at any given point is the area of the piston times the length of the cylinder. The change in volume is equal to the area of the piston times the distance along which the piston was pushed by the expanding gas.

$w = -P \times \Delta V$

$w = -P \times A \times \Delta L$

$w = -F \times \Delta L$

Note that this last line is just the definition of work under constant force in the same direction as the displacement, multiplied by the negative sign to follow the sign convention in chemistry.

Filed under Basic Chemistry, Chemistry, Chemistry Lesson of the Day, Physical Chemistry Tagged with area, chemistry, distance, force, gas, length(), physical chemistry, pressure, thermodynamics, work

Applied Statistics Lesson of the Day – Choosing the Number of Levels for Factors in Experimental Design

January 7, 2014 2 Comments

The experimenter needs to decide the number of levels for each factor in an experiment.

For a qualitative (categorical) factor, the number of levels may simply be the number of categories for that factor. However, because of cost constraints, an experimenter may choose to drop a certain category. Based on the experimenter’s prior knowledge or hypothesis, the category with the least potential for showing a cause-and-effect relationship between the factor and the response should be dropped.
For a quantitative (numeric) factor, the number of levels should reflect the cause-and-effect relationship between the factor and the response. Again, the experimenter’s prior knowledge or hypothesis is valuable in making this decision.
- If the relationship in the chosen range of the factor is hypothesized to be roughly linear, then 2 levels (perhaps the minimum and the maximum) should be sufficient.
- If the relationship in the chosen range of the factor is hypothesized to be roughly quadratic, then 3 levels would be useful. Often, 3 levels are enough.
- If the relationship in the chosen range of the factor is hypothesized to be more complicated than a quadratic relationship, consider using 4 or more levels.

Filed under Applied Statistics, Experimental Design, Statistics, Statistics Lesson of the Day Tagged with design of experiments, experimental design, factor, factor level, factor levels, qualitative factor, quantitative factor

← Older posts

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Physical Chemistry Lesson of the Day – Intensive vs. Extensive Properties

Machine Learning Lesson of the Day – Overfitting

Applied Statistics Lesson of the Day – Blocking and the Randomized Complete Blocked Design (RCBD)

Physical Chemistry Lesson of the Day – The Effect of Temperature on Changes in Internal Energy and Enthalpy

Video Tutorial: Breaking Down the Definition of the Hazard Function

Machine Learning Lesson of the Day – The “No Free Lunch” Theorem

Statistics and Chemistry Lesson of the Day – Illustrating Basic Concepts in Experimental Design with the Synthesis of Ammonia

Physical Chemistry Lesson of the Day – The Difference Between Changes in Enthalpy and Changes in Internal Energy

Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

Introduction

Conceptual Background of Rectangular Integration (a.k.a. The Midpoint Rule)

Machine Learning Lesson of the Day – Cross-Validation

Applied Statistics Lesson of the Day – The Completely Randomized Design with 1 Factor

Physical Chemistry Lesson of the Day – Heat Capacity

Machine Learning Lesson of the Day – Parametric vs. Non-Parametric Models

Applied Statistics Lesson of the Day – Positive Control in Experimental Design

How to Find a Job in Statistics – Advice for Students and Recent Graduates

Introduction

Read more of this post

Physical Chemistry Lesson of the Day – Enthalpy

Machine Learning Lesson of the Day – Babies and Non-Statisticians Practice Unsupervised Learning All the Time!

Applied Statistics Lesson of the Day – Choosing the Range of Levels for Quantitative Factors in Experimental Design

Reference:

Physical Chemistry Lesson of the Day: Pressure-Volume Work

Applied Statistics Lesson of the Day – Choosing the Number of Levels for Factors in Experimental Design

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories