Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable.  Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

The Example Data

Let’s use the data set “class” in the “sashelp” library, which should already be installed and available for you when you open SAS.  If you run the code

proc print
          data = sashelp.class noobs;
run;

you will see this data table:

Name Sex Age Height Weight
Joyce F 11 51.3 50.5
Jane F 12 59.8 84.5
Louise F 12 56.3 77.0
Alice F 13 56.5 84.0
Barbara F 13 65.3 98.0
Carol F 14 62.8 102.5
Judy F 14 64.3 90.0
Janet F 15 62.5 112.5
Mary F 15 66.5 112.0
Thomas M 11 57.5 85.0
James M 12 57.3 83.0
John M 12 59.0 99.5
Robert M 12 64.8 128.0
Jeffrey M 13 62.5 84.0
Alfred M 14 69.0 112.5
Henry M 14 63.5 102.5
Ronald M 15 67.0 133.0
William M 15 66.5 112.0
Philip M 16 72.0 150.0

The SAS Code – 2 Methods Involving PROC MEANS

Let’s calculate the average height by gender.  Here is the code to do it using PROC MEANS.  Since we will categorize the means by gender using the “BY” statement in PROC MEANS, we first need to sort the data by the variable “sex”.

proc sort
          data = sashelp.class;
          by sex;
run;


proc means
          data = sashelp.class
                    noprint
                    mean;
          var height;
          by sex;
          output
          out = mean_of_heights_by_sex (
                                            drop = _TYPE_
                                                   _FREQ_
                                       )    
                            mean = Mean_Height;
run;

There are 2 variables that I don’t want in the new data set: _TYPE_ and _FREQ_, so I drop them while the output data set is created.  The “mean” option in the “output” statement specifies the name of variable of that contains the mean.  You can leave it blank, and it will just use the original variable’s name in the new data set – in our case, “height”.  I want to specify that this isn’t just any height, but the mean height, so I declare a new variable name: Mean_Height.

Update on Wednesday, October 29, 2014: Thanks to a helpful comment below from Guillaume, here is another code to get the same result using the “CLASS” statement.  Note that I have added the “NWAY” option in the “PROC MEANS” statement to ensure that the overall mean for the entire data set is NOT printed.  This option does not require sorting ahead of time, because I am not using the “BY” statement.

proc means
          data = sashelp.class
                    nway
                    noprint
                    mean;
          var height;
          class sex;
          output
          out = mean_of_heights_by_sex (
                                            drop = _TYPE_
                                                   _FREQ_
                                       )    
                            mean = Mean_Height;
run;

If you are interested in learning more about the differences between using the “BY” statement and using the “CLASS” statement,

Viewing the Result

Now, let’s print the resulting data set to see what we get.  Both of the above methods will yield the same output data set.

proc print
          data = mean_of_heights_by_sex 
                    noobs label;
          label mean_height = 'Average Height';
run;

Here is the output data set!

Sex Mean Height
F 60.5889
M 63.9100

3 Responses to Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

  1. Guillaume says:

    Hi Eric,

    One little detail is that if you use the “by” statement, your input dataset has to be sorted. Replacing “by” by “class” allows you to obtain the same result without having to sort your dataset beforehand😉. Apparently class uses more computational time than by but unless you have millions of raw I’m sure you won’t notice the difference.

  2. Reblogged this on Psychology & Statistics Tutor:Mentor and commented:
    I have not used SAS myself~

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: