## Highlighting cells to quickly view Average, Count, and Sum in Excel

I recently needed to check my answers after some data analysis in Alteryx.  I computed many averages using a formula in Alteryx, and I wanted to check those results by calculating the average for a few randomly selected rows.  I did this by invoking a helpful tool in Microsoft Excel.  I will illustrate this functionality with some random data.

In Column E, I used a formula to calculate the average of the 3 populations in Columns B, C, and D.  To manually check that the formula is correct, I highlighted the 3 columns for ID #125.  On the bottom right, Excel calculates the average; it’s difficult to see in the picture below, but Excel confirms that the average is 707,154.

Whenever you highlight a range of cells containing numeric data, Excel will provide the average, count, and sum of the selected cells.  I did not know about this functionality when I first began working as a statistician, and I am very glad that I did learn it eventually – it is very useful for checking answers by analyzing a few randomly selected rows in Excel!

## Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

#### Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

## Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

The harmonic mean, H, for $n$ positive real numbers $x_1, x_2, ..., x_n$ is defined as

$H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}$.

This type of mean is useful for measuring the average of rates.  For example, consider a car travelling for 240 kilometres at 2 different speeds:

1. 60 km/hr for 120 km
2. 40 km/hr for another 120 km

Then its average speed for this trip is

$S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}$

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds.  If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics.  The average speed is defined as

$S_{avg} = \Delta \text{distance} \div \Delta \text{time}$

We already have the elapsed distance – it’s 240 km.  Let’s find the time elapsed for this trip.

$\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})$

$\Delta \text{time} = 5 \text{ hours}$

Thus,

$S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}$

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

## Statistics Lesson and Warning of the Day – Confusion Between the Median and the Average

Yesterday, I attended an interesting seminar called “Transforming Healthcare through Big Data” at the Providence Health Care Research Institute‘s 2014 Research Day.  The seminar was delivered by Martin Kohn from Jointly Health, and I enjoyed it overall.  However, I noticed a glaring error about basic statistics that needs correction.

Martin wanted to highlight the overconfidence that many doctors have about their abilities, and he quoted Vinod Kohsla, the co-founder of Sun Microsystems, who said, “50% of doctors are below average.”  Martin then presented a study showing an absurdly high percentage of doctors who think that they are “above average”.  A Twitter conversation between attendees of a TED conference in San Francisco and Vinod himself confirms this quotation.

The statement “50% of doctors are below average” is wrong in general.  By definition, 50% of any population is below the median, and the median is only equal to the average if the population is symmetric.  (Examples of symmetric probability distributions are the normal distribution and the Student’s t-distribution.)  Vinod meant to say that “50% of doctors are below the median”, and he confirmed this in the aforementioned Twitter conversation; I am disappointed that he justified this mistake by claiming that it would be less understood.  I think that a TED audience would know what “median” means, and those who don’t can easily search for its meaning online or in books on their own.

In communicating truth, let’s use the correct vocabulary.