Highlighting cells to quickly view Average, Count, and Sum in Excel

I recently needed to check my answers after some data analysis in Alteryx.  I computed many averages using a formula in Alteryx, and I wanted to check those results by calculating the average for a few randomly selected rows.  I did this by invoking a helpful tool in Microsoft Excel.  I will illustrate this functionality with some random data.

In Column E, I used a formula to calculate the average of the 3 populations in Columns B, C, and D.  To manually check that the formula is correct, I highlighted the 3 columns for ID #125.  On the bottom right, Excel calculates the average; it’s difficult to see in the picture below, but Excel confirms that the average is 707,154.

Whenever you highlight a range of cells containing numeric data, Excel will provide the average, count, and sum of the selected cells.  I did not know about this functionality when I first began working as a statistician, and I am very glad that I did learn it eventually – it is very useful for checking answers by analyzing a few randomly selected rows in Excel!

Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

• 5% in the 1st year
• 10% in the 2nd year
• 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of $1 + r$, where $r$ is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, $G$.  The average factor of growth from the 3 years is

$G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n}$,

where $r_i$ is the rate of return in year $i$, $i = 1, 2, 3, ..., n$.  The average annual rate of return is $G - 1$.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

Returning to our example, our average factor of growth is

$G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242$.

Thus, our annual rate of return is $G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%$.

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

1. The first set, $S_1$, consists of your data $x_1, x_2, ..., x_n$, and this set has a sample size of $n$.
2. The second, $S_2$,  set also has a sample size of $n$, but all $n$ values are the same – let’s call this common value $y$.
• What number must $y$ be such that the sums in $S_1$ and $S_2$ are equal?  This value of $y$ is the arithmetic mean of the first set.
• What number must $y$ be such that the products in $S_1$ and $S_2$ are equal?  This value of $y$ is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

Introduction

This is the first of a series of posts on exploratory data analysis (EDA).  This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R.  This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets.  In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.