sample mean | The Chemical Statistician

Mathematical Statistics Lesson of the Day – Sufficient Statistics

November 5, 2014 4 Comments

*Update on 2014-11-06: Thanks to Christian Robert’s comment, I have removed the sample median as an example of a sufficient statistic.

Suppose that you collected data

$\mathbf{X} = X_1, X_2, ..., X_n$

in order to estimate a parameter $\theta$ . Let $f_\theta(x)$ be the probability density function (PDF)* for $X_1, X_2, ..., X_n$ .

Let

$t = T(\mathbf{X})$

be a statistic based on $\mathbf{X}$ . Let $g_\theta(t)$ be the PDF for $T(X)$ .

If the conditional PDF

$h_\theta(\mathbf{X}) = f_\theta(x) \div g_\theta[T(\mathbf{X})]$

is independent of $\theta$ , then $T(\mathbf{X})$ is a sufficient statistic for $\theta$ . In other words,

$h_\theta(\mathbf{X}) = h(\mathbf{X})$ ,

and $\theta$ does not appear in $h(\mathbf{X})$ .

Intuitively, this means that $T(\mathbf{X})$ contains everything you need to estimate $\theta$ , so knowing $T(\mathbf{X})$ (i.e. conditioning $f_\theta(x)$ on $T(\mathbf{X})$ ) is sufficient for estimating $\theta$ .

Often, the sufficient statistic for $\theta$ is a summary statistic of $X_1, X_2, ..., X_n$ , such as their

sample mean
~~sample median~~ – removed thanks to comment by Christian Robert (Xi’an)
sample minimum
sample maximum

If such a summary statistic is sufficient for $\theta$ , then knowing this one statistic is just as useful as knowing all $n$ data for estimating $\theta$ .

*This above definition holds for discrete and continuous random variables.

Filed under Mathematical Statistics, Mathematics, Statistics, Statistics Lesson of the Day Tagged with conditional distribution, conditional PDF, conditional PMF, mathematical statistics, probability density function, probability mass function, sample maximum, sample mean, sample median, sample minimum, statistics, sufficient statistic

Applied Statistics Lesson of the Day – The Coefficient of Variation

August 12, 2014 3 Comments

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set. However, consider the following 2 hypothetical cases:

the standard deviation for the incomes of households in Canada is $2,000
the standard deviation for the incomes of the 5 major banks in Canada is $2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas $2,000 is not a lot of money for one of the “Big Five” banks. Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets. One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

$\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},$

where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

Analogously, the coefficient of variation for a random variable is

$\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,$

where $\sigma$ is the random variable’s standard deviation and $\mu$ is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes. I hope that all new statistics students get to learn this alternative measure of dispersion.

Filed under Applied Statistics, Descriptive Statistics, Statistics, Statistics Lesson of the Day Tagged with applied statistics, coefficient of variation, data, descriptive statistics, expected value, relative variability, sample mean, sample standard deviation, standard deviation, statistics

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Can Apply to the Sum

June 9, 2014 Leave a comment

The central limit theorem (CLT) is often stated in terms of the sample mean of independent and identically distributed random variables. An often unnoticed or forgotten aspect of the CLT is its applicability to the sample sum of those variables. Since $n$ , the sample size, is just a constant, it can be multiplied to $\bar{X}$ to obtain $\sum_{i = 1}^{n} X_i$ . For a sufficiently large $n$ , this new statistic still has an approximately normal distribution, just with a new expected value and a new variance.

$\sum_{i = 1}^{n} X_i \overset{approx.}{\sim} \text{Normal} (n\mu, n\sigma^2)$

Filed under Applied Statistics, Mathematics, Probability, Statistics, Statistics Lesson of the Day Tagged with Central Limit Theorem, normal distribution, sample mean, sample sum

Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

April 15, 2014 2 Comments

According to the central limit theorem, if

$n$ random variables, $X_1, ..., X_n$ , are independent and identically distributed,
$n$ is sufficiently large,

then the distribution of their sample mean, $\bar{X_n}$ , is approximately normal, and this approximation is better as $n$ increases.

One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of $X_1, ..., X_n$ . In my new Youtube channel, you will find a video tutorial that provides an intuitive explanation of why this is true by considering a thought experiment of rolling 2 dice. This video focuses on the intuition rather than the mathematics of the CLT. In a later video, I will discuss the technical details of the CLT and how it applies to this example.

Filed under Applied Statistics, Mathematical Statistics, Statistics, Tutorials, Video Tagged with Central Limit Theorem, dice, probability, probability distribution, sample mean, statistics, thought experiment, tutorial, video, video tutorial

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

March 19, 2014 Leave a comment

Having taught and tutored introductory statistics numerous times, I often hear students misinterpret the Central Limit Theorem by saying that, as the sample size gets bigger, the distribution of the data approaches a normal distribution. This is not true. If your data come from a non-normal distribution, their distribution stays the same regardless of the sample size.

Remember: The Central Limit Theorem says that, if $X_1, X_2, ..., X_n$ is an independent and identically distributed sample of random variables, then the distribution of their sample mean is approximately normal, and this approximation gets better as the sample size gets bigger.

Filed under Applied Statistics, Mathematical Statistics, Mathematics, Probability, Statistics, Statistics Lesson of the Day Tagged with Central Limit Theorem, distribution, math, mathematical statistics, normal distribution, probability, random variables, sample mean, sample size, statistics

Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

July 29, 2013 9 Comments

Introduction

This is a follow-up post to my recent introduction of histograms. Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R. Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots. I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA). Previous posts in this series on EDA include

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Mathematical Statistics Lesson of the Day – Sufficient Statistics

Applied Statistics Lesson of the Day – The Coefficient of Variation

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Can Apply to the Sum

Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

Mathematical and Applied Statistics Lesson of the Day – The Central Limit Theorem Applies to the Sample Mean

Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Introduction

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories