Highlighting cells to quickly view Average, Count, and Sum in Excel

I recently needed to check my answers after some data analysis in Alteryx.  I computed many averages using a formula in Alteryx, and I wanted to check those results by calculating the average for a few randomly selected rows.  I did this by invoking a helpful tool in Microsoft Excel.  I will illustrate this functionality with some random data.

In Column E, I used a formula to calculate the average of the 3 populations in Columns B, C, and D.  To manually check that the formula is correct, I highlighted the 3 columns for ID #125.  On the bottom right, Excel calculates the average; it’s difficult to see in the picture below, but Excel confirms that the average is 707,154.

Excel average

Whenever you highlight a range of cells containing numeric data, Excel will provide the average, count, and sum of the selected cells.  I did not know about this functionality when I first began working as a statistician, and I am very glad that I did learn it eventually – it is very useful for checking answers by analyzing a few randomly selected rows in Excel!

Eric’s Enlightenment for Friday, May 29, 2015

  1. P2N3: An aromatic ion made of just phosphorous and nitrogen.  (Yes, aromaticity can be entirely inorganic!)
  2. Using 3-D printing and plastics to make prosthetics.
  3. David Beckwroth and Scott Sumner talk at length about reforming monetary policy with NGDP targeting in this video interview/seminar.
  4. Anky Lai gives a nice introduction to PROC TABULATE (PDF document) – an alternative to PROC FREQ and PROC MEANS in SAS.  Check out her awesome code samples for generating nicely formatted tables and exporting them conveniently into spreadsheets in Excel!

Eric’s Enlightenment for Friday, May 8, 2015

  1. A nice set of tutorials on Microsoft Excel at OfficeTuts by Tomasz Decker.
  2. “We had proved that an assertion was indeed true in all of the difficult cases, but it turned out to be false in the simple case. We never bothered to check.”  Are mistakes in academic mathematics being effectively identified and corrected?  Vladimir Voevodsky (2002 Fields Medalist) published a major theorem in 1990, but Carlos Simpson found an error with the theorem in 1998.  It wasn’t until 2013 that Voevodsky finally became convinced that his theorem was wrong.  This motivated him to develop “proof assistants” – computer programs that help to prove mathematical theorems.
  3. Synthesizing artificial muscles from gold-plated onion skins
  4. Andrew Gelman debriefs his presentation to Princeton’s economics department about unbiasedness and econometrics.

Resources for Learning Data Manipulation in R, SAS and Microsoft Excel

I had the great pleasure of speaking to the Department of Statistics and Actuarial Science at Simon Fraser University on last Friday to share my career advice with its students and professors.  I emphasized the importance of learning skills in data manipulation during my presentation, and I want to supplement my presentation by posting some useful resources for this skill.  If you are new to data manipulation, these are good guides for how to get started in R, SAS and Microsoft Excel.

For R, I recommend Winston Chang’s excellent web site, “Cookbook for R“.  It has a specific section on manipulating data; this is a comprehensive list of the basic skills that every data analyst and statistician should learn.

For SAS, I recommend the UCLA statistical computing web page that is adapted from Oliver Schabenberger’s web site.

For Excel, I recommend Excel Easy, a web site that was started at the University of Amsterdam in 2010.  It is a good resource for learning about Excel in general, and there is no background required.  I specifically recommend the “Functions” and “Data Analysis” sections.

A blog called teachr has a good list of Top 10 skills in Excel to learn.

I like to document tips and tricks for R and SAS that I like to use often, especially if I struggled to find them on the Internet.  I encourage you to check them out from time to time, especially in my “Data Analysis” category.

If you have any other favourite resources for learning data manipulation or data analysis, please share them in the comments!

Performing Logistic Regression in R and SAS

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

  1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
  2. treatment – Whether or not the patient completed an anger control treatment program.
  3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

Read more of this post

Extracting the Postal Codes from Addresses of Hospitals in British Columbia – An Exercise in SAS Text Processing

Introduction

In my job as a Biostatistical Analyst at the British Columbia (BC) Cancer Agency in Vancouver, I recently needed to get the postal codes for the hospitals in BC.  I found a data table of the hospitals with their addresses, but I needed to extract the postal codes from the addresses.  In this tutorial, I will show you some text processing techniques in SAS that I used to extract the postal codes from that raw data file.

* This blog post contains information licensed under the Open Government License – British Columbia.

Read the rest of this post to get the SAS code for extracting the postal codes and the final spreadsheet that contains the postal codes of the hospitals in British Columbia!

Read more of this post