categorical variable | The Chemical Statistician

Convert multiple variables between character and numeric formats in SAS

April 27, 2018 5 Comments

Introduction

I often get data that are coded as character, but are actually meant to be numeric. Thus, converting them into the correct variable types is a common task, and SAS Note #24590 shows how to do so. However, I recently needed to do hundreds of these conversions, so I wanted some code to accomplish this quickly and accurately. This tutorial shows how to do so.

Let’s consider this small data set in SAS as an example. They are hypothetical statistics of 3 players from a basketball game.

data basketball1;
     input jersey points $ rebounds $ assists $;
     datalines;
21 10 14 1
4  11 3  12
23 29 4  5
;
run;

The 3 performance metrics (points, rebounds, and assists) are clearly numeric, but they are currently coded as character. (You can use PROC CONTENTS to confirm this if needed.)

The jersey number is really a character variable, because its magnitude has no real-life meaning. The National Basketball Association (NBA) allows “00” as a possible jersey number. (Robert Parish wore this jersey number; he won 4 NBA championships and reached the Naismith Basketball Hall of Fame.) If you code “00” as a numeric variable, then it will render as “0”. Thus, for NBA jersey numbers, it is best to save it as a character variable.

I can convert these variables into the correct types using the following code. Note that I chose “2.” for the length of “JERSEY”, because I know that jersey numbers in the NBA have, at most, 2 digits.

data basketball2;
     set basketball1;
 
     jersey2 = put(jersey, 2.);
     drop jersey;
     rename jersey2 = jersey;

     points2 = input(points, 8.);
     drop points;
     rename points2 = points;

     rebounds2 = input(rebounds, 8.);
     drop rebounds;
     rename rebounds2 = rebounds;

     assists2 = input(assists, 8.);
     drop assists;
     rename assists2 = assists;
run;

Despite this success, the above code can be very cumbersome when I need to do this for many variables, and this situation arose in my job recently. In this tutorial, I will show a fast way of doing these conversions for many variables at once. I will use this BASKETBALL1 data set as an example, and I will convert POINTS, REBOUNDS, and ASSISTS from character to numeric simultaneously.

Read more of this post

Filed under Data Analysis, SAS Programming Tagged with categorical data, categorical variable, character data, character variable, continuous data, continuous variable, data analysis, numeric data, numeric variable, SAS, sas programming

A macro to automate the creation of indicator variables in SAS

April 25, 2018 Leave a comment

In a recent blog post, I introduced an easy and efficient way to create indicator variables from categorical variables in SAS. This method pretends to run logistic regression, but it really is using PROC LOGISTIC to get the design matrix based on dummy-variable coding. I shared SAS code for how to do so, step-by-step.

I write this follow-up post to provide a macro that you can use to execute all of those steps in one line. If you have not read my previous post on this topic, then I strongly encourage you to do that first. Don’t use this macro blindly.

Here is the macro. The key steps are

Run PROC LOGISTIC to get the design matrix (which has the indicator variables)
Merge the original data with the newly created indicator variables
Delete the “INDICATORS” data set, which was created in an intermediate step

%macro create_indicators(input_data, target, covariates, output_data);

proc logistic
     data = &input_data
          noprint
          outdesign = indicators;
     class &covariates / param = glm;
     model &target = &covariates;
run;


data &output_data;
      merge    &input_data
               indicators (drop = Intercept &target);
run;


proc datasets 
     library = work
          noprint;
     delete indicators;
run;

%mend;

I will use the built-in data set SASHELP.CARS to illustrate the use of my macro. As you can see, my macro can accept multiple categorical variables as inputs for creating indicator variables. I will do that here for the variables TYPE, MAKE, and ORIGIN.

Read more of this post

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Data Analysis, SAS Programming, Statistics, Tutorials Tagged with categorical data, Categorical Data Analysis, categorical variable, data analysis, dummy coding, dummy variables, indicator, indicator variable, indicator variables, indicators, SAS, sas programming, statistics

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

April 24, 2018 Leave a comment

Introduction

In statistics and biostatistics, the creation of binary indicators is a very useful practice.

They can be useful predictor variables in statistical models.
They can reduce the amount of memory required to store the data set.
They can treat a categorical covariate as a continuous covariate in regression, which has certain mathematical conveniences.

However, the creation of indicator variables can be a long, tedious, and error-prone process. This is especially true if there are many categorical variables, or if a categorical variable has many categories. In this tutorial, I will show an easy and efficient way to create indicator variables in SAS. I learned this technique from SAS usage note #23217: Saving the coded design matrix of a model to a data set.

The Example Data Set

Let’s consider the PRDSAL2 data set that is built into the SASHELP library. Here are the first 5 observations; due to a width constraint, I will show the first 5 columns and the last 6 columns separately. (I encourage you to view this data set using PROC PRINT in SAS by yourself.)

COUNTRY	STATE	ACTUAL	PREDICT
U.S.A.	California	$987.36	$692.24
U.S.A.	California	$1,782.96	$568.48
U.S.A.	California	$32.64	$16.32
U.S.A.	California	$1,825.12	$756.16
U.S.A.	California	$750.72	$723.52

PRODTYPE	PRODUCT	YEAR	QUARTER	MONTH	MONYR
FURNITURE	SOFA	1995	1	Jan	JAN95
FURNITURE	SOFA	1995	1	Feb	FEB95
FURNITURE	SOFA	1995	1	Mar	MAR95
FURNITURE	SOFA	1995	2	Apr	APR95
FURNITURE	SOFA	1995	2	May	MAY95

Read more of this post

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

August 16, 2017 5 Comments

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected. Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1;
     set sashelp.class;
 
     * define a new character variable to classify someone as tall or short;
     if height > 60
     then height_class = 'Tall';
          else height_class = 'Short';
run;


* print the results for the first 5 rows;
proc print
     data = c1 (obs = 5);
run;

Here is the result:

Obs	Name	Sex	Age	Height	Weight	height_class
1	Alfred	M	14	69.0	112.5	Tall
2	Alice	F	13	56.5	84.0	Shor
3	Barbara	F	13	65.3	98.0	Tall
4	Carol	F	14	62.8	102.5	Tall
5	Henry	M	14	63.5	102.5	Tall

What happened? Why does the word “Short” render as “Shor”?

Read more of this post

Filed under Categorical Data Analysis, Data Analysis, R programming, SAS Programming, Statistics, Tutorials Tagged with categorical data, categorical variable, character data, character variable, length(), R, r programing, SAS, sas programming

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

February 3, 2015 32 Comments

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with categorical variable, class(), count, data frame, factor, frequency table, install.packages(0, mtcars, names(), plyr, R, R programming, subset, table()

Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

August 4, 2014 Leave a comment

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals. (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.) Today, I want to share a second video of calculating expected counts – this time, using joint probabilities. This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table. Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample. This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my YouTube channel to gain a deeper understanding of how and why they can be calculated. Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

Filed under Applied Statistics, Biostatistics, Categorical Data Analysis, Statistics, Tutorials, Video Tagged with Categorical Data Analysis, categorical random variable, categorical variable, contingency table, expected count, joint probability, math, probability, statistics; biostatistics

Displaying Isotopic Abundance Percentages with Bar Charts and Pie Charts

February 17, 2013 Leave a comment

The Structure of an Atom

An atom consists of a nucleus at the centre and electrons moving around it. The nucleus contains a mixture of protons and neutrons. For most purposes in chemistry, the two most important properties about these 3 types of particles are their masses and charges. In terms of charge, protons are positive, electrons are negative, and neutrons are neutral. A proton’s mass is roughly the same as a neutron’s mass, but a proton is almost 2,000 times heavier than an electron.

This image shows a lithium atom, which has 3 electrons, 3 protons, and 4 neutrons.

Source: Wikimedia Commons

Read more of this post

Filed under Basic Chemistry, Chemistry, Data Visualization, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with atomic mass number, atomic number, bar chart, barplot(), categorical variable, chemistry, data, Data Visualization, descriptive statistics, isotope, neutron, pie chart, pie(), plot, plots, plotting, PNG, proton, R, R programming, statistics

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

Convert multiple variables between character and numeric formats in SAS

Introduction

A macro to automate the creation of indicator variables in SAS

An easy and efficient way to create indicator variables (a.k.a. dummy variables) from a categorical variable in SAS

Introduction

The Example Data Set

Use the LENGTH statement to pre-set the lengths of character variables in SAS – with a comparison to R

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

Introduction

The Example Data Set – mtcars

Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

Displaying Isotopic Abundance Percentages with Bar Charts and Pie Charts

The Structure of an Atom

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories