Extracting the Postal Codes from Addresses of Hospitals in British Columbia – An Exercise in SAS Text Processing
November 17, 2014 4 Comments
In my job as a Biostatistical Analyst at the British Columbia (BC) Cancer Agency in Vancouver, I recently needed to get the postal codes for the hospitals in BC. I found a data table of the hospitals with their addresses, but I needed to extract the postal codes from the addresses. In this tutorial, I will show you some text processing techniques in SAS that I used to extract the postal codes from that raw data file.
* This blog post contains information licensed under the Open Government License – British Columbia.
Read the rest of this post to get the SAS code for extracting the postal codes and the final spreadsheet that contains the postal codes of the hospitals in British Columbia!
The Raw Data File
A quick search on Google led me to a PDF file of the addresses of the hospitals in British Columbia. Thanks to Daphne Guh from the Centre for Health Evaluation and Outcome Sciences (CHEOS), I obtained an Excel spreadsheet of the same table – click here to download this spreadsheet. (I later learned that you can convert tables from PDF to Excel using Adobe Acrobat X Pro.). This was my input for my SAS program for extracting the postal codes. The hospitals are uniquely identified by their facility numbers.
The address information is very confusing. The address of the hospital is in one column, but the name of the hospital is in another column. Even more confusingly, the name of the hospital is stuck with the address of the health authority. Thus, I have written long and explicit column names to distinguish between the 2 address columns.
The Extraction Strategy
Canadian postal codes are sets of 2 character triplets that are separated by a space:
- The first triplet has a letter-number-letter permutation.
- The second triplet has a number-letter-number permutation.
In British Columbia, all postal codes start with the letter “V” in the first triplet. My strategy was to look for any string in the hospital address column that starts with a capital “V”, followed by a number. Since there are only 10 possible numbers (0-9), typing 10 if/else statements was a quick and easy task. Once I identified the position of that first “V”, I then extracted the 7-character text starting from that position in the hospital address column.
As always, I checked my results to ensure that the correct text strings were extracted. Unfortunately, 2 postal codes have a capital V in the second triplet of characters. To account for these 2 exceptions, I manually typed their postal codes in the if-else statements. Thankfully, my strategy worked for the rest of the hospitals.
Astute programmers may wonder why I did not look for strings that started with a space, then a “V”, then a number. I tried that, but the space before the first “V” is actually a generic symbol for an unrecognizable character – it likely denotes the indentation before the postal codes for most of the addresses; you can see this in the output window (or in the .lst file) when you print a few rows of the raw data file.
The SAS Script
Here is my script for implement my above strategy to extract the postal codes. I renamed the columns in the spreadsheet to shorter names for convenience.
Note my use of the
- find() function for finding the position along the “address” variable that contains a “V” followed by a number.
- substr() function for extracting my string of interest from the “address” variable
- some useful formatting options at the beginning of my script
dm 'cle log; cle out;'; ods html close; ods html; dm 'odsresults; clear'; ods listing close; ods listing; options noovp linesize = 79 formdlim = '-' pageno = min; * import hospital address list from Ministry of Health; proc import datafile = "Your Directory Path Here\hospital names and codes.xlsx" dbms = xlsx out = hospitals replace; run; * extract postal codes from this file; * look for a string starting with V and then a number; * if such a string exists, then get the starting position of the postal code in the "name" column; data postal_codes; set hospitals ( rename = (City_and_Hospital_Address = address) keep = City_and_Hospital_Address Facility_Number ); if find(address, 'V0') > 0 then start_position = find(address, 'V0'); else if find(address, 'V1') > 0 then start_position = find(address, 'V1'); else if find(address, 'V2') > 0 then start_position = find(address, 'V2'); else if find(address, 'V3') > 0 then start_position = find(address, 'V3'); else if find(address, 'V4') > 0 then start_position = find(address, 'V4'); else if find(address, 'V5') > 0 then start_position = find(address, 'V5'); else if find(address, 'V6') > 0 then start_position = find(address, 'V6'); else if find(address, 'V7') > 0 then start_position = find(address, 'V7'); else if find(address, 'V8') > 0 then start_position = find(address, 'V8'); else if find(address, 'V9') > 0 then start_position = find(address, 'V9'); label Postal_Code = 'Postal Code'; * 2 hospitals have postal codes with "V" in the 2nd triplet; * let's manually enter the postal codes for them to avoid errors; if facility_number = '204' then Postal_Code = 'V8N 1V7'; else if facility_number = '508' then Postal_Code = 'V9W 3V1'; * all other postal codes can be extracted using the substr() function from the starting position; else Postal_Code = substr(address, start_position, 7); * delete any row without a postal code (i.e. a starting position); if start_position = . then delete; run;
If you run
proc print data = postal_codes (obs = 5) noobs; run;
you will see the first 5 rows of the data set “CODES”:
|ABBOTSFORD 32900 Marshall Road V2S 0C2 Ph: (604) 851-4700||609||34||V2S 0C2|
|ALERT BAY 49 School Rd V0N 1A0 Ph: (250) 974-5585 Fax: (250) 974-5422||507||26||V0N 1A0|
|ASHCROFT 700 Elm St Box 488 V0K 1A0 Ph: (250) 453-2211 Fax: (250) 453-1921||408||32||V0K 1A0|
|BELLA BELLA 88 Waglisla St Bella Bella V0T 1Z0 Ph: (250) 957-2314 Fax: (250) 957-2612||904||44||V0T 1Z0|
|BELLA COOLA (Mackay St) Box 220 V0T 1C0 Ph: (250) 799-5311 Fax: (250) 799-5635||906||35||V0T 1C0|
You can download the full Excel spreadsheet of the output table here.
BC Open License Acknowledgment
I am grateful to the Government of British Columbia for allowing me to share the original PDF file, its equivalent Excel spreadsheet, and my output Excel spreadsheet with the extracted postal codes. These documents contain information licensed under the Open Government License of British Columbia.