## Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

#### Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

## Statistics Lesson and Warning of the Day – Confusion Between the Median and the Average

Yesterday, I attended an interesting seminar called “Transforming Healthcare through Big Data” at the Providence Health Care Research Institute‘s 2014 Research Day.  The seminar was delivered by Martin Kohn from Jointly Health, and I enjoyed it overall.  However, I noticed a glaring error about basic statistics that needs correction.

Martin wanted to highlight the overconfidence that many doctors have about their abilities, and he quoted Vinod Kohsla, the co-founder of Sun Microsystems, who said, “50% of doctors are below average.”  Martin then presented a study showing an absurdly high percentage of doctors who think that they are “above average”.  A Twitter conversation between attendees of a TED conference in San Francisco and Vinod himself confirms this quotation.

The statement “50% of doctors are below average” is wrong in general.  By definition, 50% of any population is below the median, and the median is only equal to the average if the population is symmetric.  (Examples of symmetric probability distributions are the normal distribution and the Student’s t-distribution.)  Vinod meant to say that “50% of doctors are below the median”, and he confirmed this in the aforementioned Twitter conversation; I am disappointed that he justified this mistake by claiming that it would be less understood.  I think that a TED audience would know what “median” means, and those who don’t can easily search for its meaning online or in books on their own.

In communicating truth, let’s use the correct vocabulary.

## Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

#### Introduction

This is the first of a series of posts on exploratory data analysis (EDA).  This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R.  This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets.  In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.