Extracting the Postal Codes from Addresses of Hospitals in British Columbia – An Exercise in SAS Text Processing

Introduction

In my job as a Biostatistical Analyst at the British Columbia (BC) Cancer Agency in Vancouver, I recently needed to get the postal codes for the hospitals in BC.  I found a data table of the hospitals with their addresses, but I needed to extract the postal codes from the addresses.  In this tutorial, I will show you some text processing techniques in SAS that I used to extract the postal codes from that raw data file.

* This blog post contains information licensed under the Open Government License – British Columbia.

Read the rest of this post to get the SAS code for extracting the postal codes and the final spreadsheet that contains the postal codes of the hospitals in British Columbia!

The Raw Data File

A quick search on Google led me to a PDF file of the addresses of the hospitals in British Columbia.  Thanks to Daphne Guh from the Centre for Health Evaluation and Outcome Sciences (CHEOS), I obtained an Excel spreadsheet of the same table – click here to download this spreadsheet.  (I later learned that you can convert tables from PDF to Excel using Adobe Acrobat X Pro.).  This was my input for my SAS program for extracting the postal codes.  The hospitals are uniquely identified by their facility numbers.

The address information is very confusing.  The address of the hospital is in one column, but the name of the hospital is in another column.  Even more confusingly, the name of the hospital is stuck with the address of the health authority.  Thus, I have written long and explicit column names to distinguish between the 2 address columns.

The Extraction Strategy

Canadian postal codes are sets of 2 character triplets that are separated by a space:

  • The first triplet has a letter-number-letter permutation.
  • The second triplet has a number-letter-number permutation.

In British Columbia, all postal codes start with the letter “V” in the first triplet.  My strategy was to look for any string in the hospital address column that starts with a capital “V”, followed by a number.  Since there are only 10 possible numbers (0-9), typing 10 if/else statements was a quick and easy task.  Once I identified the position of that first “V”, I then extracted the 7-character text starting from that position in the hospital address column.

As always, I checked my results to ensure that the correct text strings were extracted.  Unfortunately, 2 postal codes have a capital V in the second triplet of characters.  To account for these 2 exceptions, I manually typed their postal codes in the if-else statements.  Thankfully, my strategy worked for the rest of the hospitals.

Astute programmers may wonder why I did not look for strings that started with a space, then a “V”, then a number.  I tried that, but the space before the first “V” is actually a generic symbol for an unrecognizable character – it likely denotes the indentation before the postal codes for most of the addresses; you can see this in the output window (or in the .lst file) when you print a few rows of the raw data file.

The SAS Script

Here is my script for implement my above strategy to extract the postal codes.  I renamed the columns in the spreadsheet to shorter names for convenience.

Note my use of the

  • find() function for finding the position along the “address” variable that contains a “V” followed by a number.
  • substr() function for extracting my string of interest from the “address” variable
  • some useful formatting options at the beginning of my script
dm 'cle log; cle out;';
ods html closeods html;

dm 'odsresults; clear';
ods listing close;
ods listing;

options 
     noovp
     linesize = 79
     formdlim = '-'
     pageno = min;

* import hospital address list from Ministry of Health;
proc import 
          datafile = "Your Directory Path Here\hospital names and codes.xlsx" 
                     dbms = xlsx
                     out = hospitals
                     replace;
run;

* extract postal codes from this file;
* look for a string starting with V and then a number;
* if such a string exists, then get the starting position of the postal code in the "name" column;

data postal_codes;
          set hospitals
                     (
                               rename = (City_and_Hospital_Address = address)
                               keep = City_and_Hospital_Address Facility_Number
                     );
          if find(address, 'V0') > 0
                     then start_position = find(address, 'V0');

          else if find(address, 'V1') > 0
                     then start_position = find(address, 'V1');

          else if find(address, 'V2') > 0
                     then start_position = find(address, 'V2');

          else if find(address, 'V3') > 0
                     then start_position = find(address, 'V3');

          else if find(address, 'V4') > 0
                     then start_position = find(address, 'V4');

          else if find(address, 'V5') > 0
                     then start_position = find(address, 'V5');

          else if find(address, 'V6') > 0
                     then start_position = find(address, 'V6');

          else if find(address, 'V7') > 0 
                     then start_position = find(address, 'V7');

          else if find(address, 'V8') > 0
                     then start_position = find(address, 'V8');

          else if find(address, 'V9') > 0
                     then start_position = find(address, 'V9');

          label Postal_Code = 'Postal Code';

          * 2 hospitals have postal codes with "V" in the 2nd triplet;
          * let's manually enter the postal codes for them to avoid errors;

          if facility_number = '204'
                     then Postal_Code = 'V8N 1V7';

          else if facility_number = '508'
                     then Postal_Code = 'V9W 3V1';

          * all other postal codes can be extracted using the substr() function from the starting position;
          else Postal_Code = substr(address, start_position, 7);

          * delete any row without a postal code (i.e. a starting position);
          if start_position = .
                     then delete;
run;

If you run

proc print
     data = postal_codes (obs = 5)
              noobs;
run;

you will see the first 5 rows of the data set “CODES”:

address Facility_Number start_position Postal_Code
ABBOTSFORD 32900 Marshall Road V2S 0C2 Ph: (604) 851-4700 609 34 V2S 0C2
ALERT BAY 49 School Rd V0N 1A0 Ph: (250) 974-5585 Fax: (250) 974-5422 507 26 V0N 1A0
ASHCROFT 700 Elm St Box 488 V0K 1A0 Ph: (250) 453-2211 Fax: (250) 453-1921 408 32 V0K 1A0
BELLA BELLA 88 Waglisla St Bella Bella V0T 1Z0 Ph: (250) 957-2314 Fax: (250) 957-2612 904 44 V0T 1Z0
BELLA COOLA (Mackay St) Box 220 V0T 1C0 Ph: (250) 799-5311 Fax: (250) 799-5635 906 35 V0T 1C0

You can download the full Excel spreadsheet of the output table here.

BC Open License Acknowledgment

I am grateful to the Government of British Columbia for allowing me to share the original PDF file, its equivalent Excel spreadsheet, and my output Excel spreadsheet with the extracted postal codes.  These documents contain information licensed under the Open Government License of British Columbia.

4 Responses to Extracting the Postal Codes from Addresses of Hospitals in British Columbia – An Exercise in SAS Text Processing

  1. Patrick Daniele says:

    Another approach to this problem is to use Prxparse, prxmatch, and prxposn. They will allow you to search for patterns within your string. I wrote a program that uses these functions to do the same processing.

    data Postal_Codes;
    set hospitals(rename = (City_and_Hospital_Address = address)
    keep = City_and_Hospital_Address Facility_Number);

    /*Define Pattern of form V#A A#A, where A is any character*/
    pattern = prxparse(‘/[V]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}/’);
    /*Determine if pattern is in string*/
    if prxmatch(pattern, address) then do;
    /*If so, extract*/
    postal_code = prxposn(pattern,0,address);
    end;

    /*Since there are headers and categories embedded in the dataset, remove them*/
    if missing(postal_code) then delete;
    run;

    These functions have the potential to save a lot of time and allow for complex patterns to be found.

  2. Maja Grubisic says:

    Just used the spreadsheet with hospital codes you posted here Eric – thank you!

Your thoughtful comments are much appreciated!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: