← Eric’s Enlightenment for Friday, June 5, 2015

Organic Chemistry Lesson of the Day – The 4 Conformational Isomers of Butane →

How to Extract a String Between 2 Characters in R and SAS

June 18, 2015 24 Comments

Introduction

I recently needed to work with date values that look like this:

mydate

Jan 23/2

Aug 5/20

Dec 17/2

I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash. I needed to think about how to program this carefully in both R and SAS, because

the length of the day could be 1 or 2 characters long
I needed a code that adapted to this varying length from observation to observation
there is no function in either language that is suited exactly for this purpose.

In this tutorial, I will show you how to do this in both R and SAS. I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!

Extracting a String Between 2 Characters in R

I will write a function called getstr() in R to extract a string between 2 characters. The strategy is simple:

Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.

##### Extracting a String Between 2 Characters in R
##### By Eric Cai - The Chemical Statistician

# clear all variables in workspace
rm(list=ls(all=TRUE))

# create a vector of 3 example dates
mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')

# getstr() is my customized function
# it extracts a string between 2 characters in a string variable
getstr = function(mystring, initial.character, final.character)
{
 
     # check that all 3 inputs are character variables
     if (!is.character(mystring))
     {
          stop('The parent string must be a character variable.')
     }
 
     if (!is.character(initial.character))
     {
          stop('The initial character must be a character variable.')
     }
 
 
     if (!is.character(final.character))
     {
          stop('The final character must be a character variable.')
     }

  
 
     # pre-allocate a vector to store the extracted strings
     snippet = rep(0, length(mystring))
 
 
 
     for (i in 1:length(mystring))
     {
          # extract the initial position
          initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1
  
          # extract the final position
          final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1
 
          # extract the substring between the initial and final positions, inclusively
          snippet[i] = substr(mystring[i], initial.position, final.position)
     }
 
     return(snippet)
}


# use the getstr() function to extract the day between the comma and the slash in "mydate"
getstr(mydate, ' ', '/')

Here is the output from getstr() on the vector “mydate”

> getstr(mydate, ' ', '/')
 [1] "23" "5" "17"

Extracting a String Between 2 Characters in SAS

I will write a macro program called %getstr(). It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.

The only tricky part in this macro program was creating a new data set name. The input data set is called “dates”, and I wanted to create a new data set called “dates2”. I accomplished that by appending %dataset with “.2” within the macro.

First, let’s create the input data set. Notice my use of the “#” as a delimiter when inputting the dates.

data dates;
     infile datalines dlm = '#';
     input mydate $;
     datalines;
          Jan 23/2015#
          Aug 5/2001#
          Dec 17/2007
     ;
run;

Let’s now write the macro program %getstr(). It will create a new data set with the appendix “2”.

%macro getstr(dataset, string_variable);

data &dataset.2;
          set &dataset;

          * search the string for the position of the space after the month;
           space_position = INDEX(&string_variable, ' ');

          * search the string for the position of the slash after the month;
           slash_position = INDEX(&string_variable, '/');

          * calculate the length between the space and the slash;
           space_to_slash = slash_position - space_position;

          * extract the day from the original string (the character(s) between the space and the slash;
           day = substr(&string_variable, space_position, space_to_slash);
run;

%mend getstr;

Let’s use the %getstr() macro program to create a new data set called “dates2” that contains the day of each date. I’ll print the results afteward.

%getstr(dates, mydate);

proc print
     data = dates2
          noobs;
 run;

Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.

mydate	space_position	substring_afterspace	day
Jan 23/2	4	23/2	23
Aug 5/20	4	5/20	5
Dec 17/2	4	17/2	17

Filed under Data Analysis, R programming, SAS Programming Tagged with dates, macro, macro program, R, R programming, SAS, text, text processing

24 Responses to How to Extract a String Between 2 Characters in R and SAS

Christoph says:

June 18, 2015 at 8:34 pm

I do not know SAS, but why not use the built-in R Datetime functions? Much faster, and standardised!

# test dates mydate <- c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')
# make sure to have the correct locate (as.Date %b works on abbreviated month name by locale) # you do not need this if you already have an English computer setup Sys.setlocale("LC_TIME", "English") # convert to R Date my.R.date <- as.Date(mydate, format="%b %e/%Y") # check vector output my.R.date

Reply
- Christoph says:
  
  June 18, 2015 at 8:50 pm
  
  ops, forgot to extract the day…
  
  # need lubridate for day() function library("lubridate") day(my.R.date)
  
  Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 9:47 am
  
  Thanks for this, Christoph. I wanted to write a function that worked for any string, not just for dates, but you taught me a new set of functions! Thanks!
  
  Reply
Dean Attali (@daattali) says:

June 18, 2015 at 8:50 pm

Why not just use `sub` with a single regular expression? Is this solution any less robust?

getstr = function(mystring, initial.character, final.character) {
sub(sprintf(“.*%s(.*)%s.*”, initial.character, final.character), “\\1”, mystring)
}

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 9:48 am
  
  I didn’t know the sprintf() function before. Thanks for sharing, Dean! (I don’t know if this is more or less robust than my function – anyone have ideas?)
  
  Reply
- Ian Kyle says:
  
  June 19, 2015 at 4:12 pm
  
  damn. you beat me to it.
  
  Reply
Tiffany Timbers says:

June 18, 2015 at 10:23 pm

Could you also use str_extract() and sub() with regular expressions in R? For example, something like this?

library(stringr)

mydate = c(‘Jan 23/2012’, ‘Aug 5/2011’, ‘Dec 17/2011’)

getstr <- function(mystring, initial.character, final.character) {
the_pattern <- paste(initial.character,".*",final.character, sep="")
snippet <- str_extract(mystring, the_pattern)
snippet <- sub("^.", "", snippet)
snippet <- sub(".$", "", snippet)
return(snippet)
}

getstr(mydate, " ", "/")

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 9:48 am
  
  I didn’t know about the “stringr” package – thanks for sharing, Tiffany!
  
  Reply
anspiess says:

June 19, 2015 at 1:36 am

How about a one-liner using regular expression in which we eliminate word + space and / + digits and extract the numbers inbetween by backreferencing:

> sub(“\\w* (\\d*)/\\d*”, “\\1”, mydate, perl = T)
[1] “23” “5” “17”

Cheers,
Andrej

Reply
Fëanor says:

June 19, 2015 at 2:06 am

Or would this work?

sapply(strsplit(mydate, “/”), function(x) unlist(strsplit(x[[1]], ” “)))[2,]

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 10:00 am
  
  Yup – that works! Thanks for the alternative method, Fëanor!
  
  Reply
Stefan says:

June 19, 2015 at 6:33 am

May I recommend using stringr? It’s now based on the superior ICU library and has the more consistent syntax. But otherwise it’s the same as Andrej’s solution.

library(stringr)
str_replace(mydate, “.* (.*)/.*”, “\\1”)

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 10:01 am
  
  Yes, Tiffany recommended stringr, too. Thanks, Stefan!
  
  Reply
William Doane says:

June 19, 2015 at 6:47 am

Regular Expressions really are the right tool for the job, per anspiess.

Also in R, similar to anspiess solution, but preserving your original data and with the option of extracting each piece of the date:

library(stringr)
mydates <- c("Jan 23/2", "Aug 5/20", "Dec 17/2")
str_match(mydates, "^.*\\s([^/]*).*$")

[,1] [,2]
[1,] "Jan 23/2" "23"
[2,] "Aug 5/20" "5"
[3,] "Dec 17/2" "17"

str_match(mydates, "^(.*)\\s([^/]*)/(.*)$")

[,1] [,2] [,3] [,4]
[1,] "Jan 23/2" "Jan" "23" "2"
[2,] "Aug 5/20" "Aug" "5" "20"
[3,] "Dec 17/2" "Dec" "17" "2"

Reply
- William Doane says:
  
  June 19, 2015 at 6:50 am
  
  ^ Starting at the beginning of the string
  .* match zero or more characters
  \\s followed by a space
  ( then remember….
  [^/]* zero or more characters that aren’t a slash
  )
  .* followed by zero or more characters
  $ before reaching the end of the string
  
  Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 10:02 am
  
  Gotta learn more about regular expressions – thanks, William!
  
  Reply
Bryan Britten says:

June 19, 2015 at 10:20 am

Eric,

I agree with Tiffany about the use of str_extract to accomplish this. I’d go one step further though and use str_exact without a function since regular expressions can always handle any pattern we’re thinking of. In this example, we can just use str_extract(my_dates, ‘[0-9]{1,2}(?=/[0-9]{4}’)) to get the values, where (?=/[0-9]{4}) is called a “negative look ahead”. The whole regular expression just says “I’m looking for 1-2 digits that are followed by a slash and then four digits”. str_extract will return only the 1-2 digits and not the pattern found in the negative look ahead. Negative look aheads (and negative look behinds) are very powerful and more information on those can be found here: http://www.regular-expressions.info/lookaround.html

However, let’s go back to your initial code. We all love writing functions and coming up with creative ways to do things, it’s why we code. At least for me anyway. And I’ve always found it helpful when people not only provide alternative methods for doing what I’m trying to do, but also help me understand how my code can be better. So with that in mind, here’s how I would have written your code, just so you can see a slightly different approach:

#new coding best practices suggest not using ‘.’ to separate words, but instead use “_”
#for simplicity reasons, I’m going to refer to your second and third parameters as ic and fc
get_str = function(my_string, ic, fc)
{

# check that all 3 inputs are character variables
if (!all(is.character(my_string), is.character(ic), is.character(fc)))
{
stop(‘All three arguments much be character variables’)
}

# initializing an empty variable is more memory efficient than creating one with zeroes
# initializing with zeroes is, in my opinion, only useful if we need to index during a for loop
snippet <- NULL

for (i in 1:length(mystring))
{
# grep( ) is a bit more powerful here because it's default return value is the index
initial_position = grep(ic, my_string[i]) + 1
final_position = grep(fc, my_string[i]) – 1

# we can now just use the c( ) function to append our value to our snippet vector
snippet = c(snippet, substr(my_string[i], initial_position, final_position))
}

return(snippet)
}

Again, my goal in rewriting your function was to give a new perspective on it to maybe shine a light on some R feature you might not have known about. Hope it helps!

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 10:53 am
  
  Wow – thanks for taking the time to write this detailed reply, Bryan! I really appreciate your feedback!
  
  I ran your code. The function was fine, but, when I tried to use get_str(), I got the following error.
  
  Error in get_str(mydate, ” “, “/”) : object ‘mystring’ not found
  
  Any idea why this is happening? I can’t figure it out.
  
  Thanks again!
  
  Reply
  - Bryan Britten says:
    
    June 19, 2015 at 11:05 am
    
    My pleasure! The issue here is that in the function definition, I changed the variable to be “my_string” instead of “mystring”. When I went through and changed all of the other variables to be “my_string”, I missed one. If you look at the beginning of the for loop, you’ll see that it still says “for (i in 1:length(mystring))” when it should say “for (i in 1:length(my_string))”. Changing that should fix the function and allow it to work as intended.
  - Eric Cai - The Chemical Statistician says:
    
    June 19, 2015 at 11:08 am
    
    Ahh – I missed that, too. Thanks, Bryan! I really appreciate your time!
Jeremy says:

June 19, 2015 at 10:44 am

Cunningham’s Law in full effect in these comments.

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 19, 2015 at 10:57 am
  
  Cunningham’s law: “The best way to get the right answer on the Internet is not to ask a question, it’s to post the wrong answer.”
  
  Hi Jeremy,
  
  I get the gist of what you’re saying. However, my functions do work – it’s just that the kind commenters in this post have shared better solutions. I take great care in ensuring that my functions work as intended when I post them. There may be better solutions – and I always hope so, as I can learn from the smart commenters on my blog – but mine still do work.
  
  Reply
msanregret says:

June 21, 2015 at 11:05 pm

Awesome work Eric! Love the SAS blogs, keep it up!

Reply
- Eric Cai - The Chemical Statistician says:
  
  June 22, 2015 at 9:50 am
  
  Thanks for visiting my blog, MItchell!
  
  Reply

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

The Chemical Statistician

How to Extract a String Between 2 Characters in R and SAS

Introduction

Extracting a String Between 2 Characters in R

Extracting a String Between 2 Characters in SAS

24 Responses to How to Extract a String Between 2 Characters in R and SAS

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories

The Chemical Statistician

How to Extract a String Between 2 Characters in R and SAS

Introduction

Extracting a String Between 2 Characters in R

Extracting a String Between 2 Characters in SAS

Share this:

Related

24 Responses to How to Extract a String Between 2 Characters in R and SAS

Your thoughtful comments are much appreciated! Cancel reply

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories