Extracting the Postal Codes from Addresses of Hospitals in British Columbia – An Exercise in SAS Text Processing

Introduction

In my job as a Biostatistical Analyst at the British Columbia (BC) Cancer Agency in Vancouver, I recently needed to get the postal codes for the hospitals in BC.  I found a data table of the hospitals with their addresses, but I needed to extract the postal codes from the addresses.  In this tutorial, I will show you some text processing techniques in SAS that I used to extract the postal codes from that raw data file.

* This blog post contains information licensed under the Open Government License – British Columbia.

Read the rest of this post to get the SAS code for extracting the postal codes and the final spreadsheet that contains the postal codes of the hospitals in British Columbia!

Read more of this post

Useful Functions in R for Manipulating Text Data

Introduction

In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.)  In this post, I describe some common functions in R that I often use for text processing.

Obtaining Basic Information about Character Variables

In R, I often work with text data in the form of character variables.  To check if a variable is a character variable, use the is.character() function.

> year = 2014
> is.character(year)
[1] FALSE

If a variable is not a character variable, you can convert it to a character variable using the as.character() function.

> year.char = as.character(year)
> is.character(year.char)
[1] TRUE

A basic piece of information about a character variable is the number of characters that exist in this string.  Use the nchar() function to obtain this information.

> nchar(year.char)
[1] 4

Read more of this post