How to Extract a String Between 2 Characters in R and SAS

Introduction

I recently needed to work with date values that look like this:

mydate
Jan 23/2
Aug 5/20
Dec 17/2

I wanted to extract the day, and the obvious strategy is to extract the text between the space and the slash.  I needed to think about how to program this carefully in both R and SAS, because

  1. the length of the day could be 1 or 2 characters long
  2. I needed a code that adapted to this varying length from observation to observation
  3. there is no function in either language that is suited exactly for this purpose.

In this tutorial, I will show you how to do this in both R and SAS.  I will write a function in R and a macro program in SAS to do so, and you can use the function and the macro program as you please!

Extracting a String Between 2 Characters in R

I will write a function called getstr() in R to extract a string between 2 characters.  The strategy is simple:

  1. Find the position of the initial character and add 1 to it – that is the initial position of the desired string.
  2. Find the position of the final character and subtract 1 from it – that is the final position of the desired string.
  3. Use the substr() function to extract the desired string inclusively between the initial position and final position as found in Steps 1-2.

 

##### Extracting a String Between 2 Characters in R
##### By Eric Cai - The Chemical Statistician

# clear all variables in workspace
rm(list=ls(all=TRUE))

# create a vector of 3 example dates
mydate = c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')

# getstr() is my customized function
# it extracts a string between 2 characters in a string variable
getstr = function(mystring, initial.character, final.character)
{
 
     # check that all 3 inputs are character variables
     if (!is.character(mystring))
     {
          stop('The parent string must be a character variable.')
     }
 
     if (!is.character(initial.character))
     {
          stop('The initial character must be a character variable.')
     }
 
 
     if (!is.character(final.character))
     {
          stop('The final character must be a character variable.')
     }

  
 
     # pre-allocate a vector to store the extracted strings
     snippet = rep(0, length(mystring))
 
 
 
     for (i in 1:length(mystring))
     {
          # extract the initial position
          initial.position = gregexpr(initial.character, mystring[i])[[1]][1] + 1
  
          # extract the final position
          final.position = gregexpr(final.character, mystring[i])[[1]][1] - 1
 
          # extract the substring between the initial and final positions, inclusively
          snippet[i] = substr(mystring[i], initial.position, final.position)
     }
 
     return(snippet)
}


# use the getstr() function to extract the day between the comma and the slash in "mydate"
getstr(mydate, ' ', '/')

 

Here is the output from getstr() on the vector “mydate”

> getstr(mydate, ' ', '/')
 [1] "23" "5" "17"

 

 

Extracting a String Between 2 Characters in SAS

I will write a macro program called %getstr().  It will accept a data set and the string variable as inputs, and it will create a new data set with the day extracted as a new variable.

The only tricky part in this macro program was creating a new data set name.  The input data set is called “dates”, and I wanted to create a new data set called “dates2”.  I accomplished that by appending %dataset with “.2” within the macro.

 

First, let’s create the input data set.  Notice my use of the “#” as a delimiter when inputting the dates.

 

data dates;
     infile datalines dlm = '#';
     input mydate $;
     datalines;
          Jan 23/2015#
          Aug 5/2001#
          Dec 17/2007
     ;
run;

 

Let’s now write the macro program %getstr().  It will create a new data set with the appendix “2”.

%macro getstr(dataset, string_variable);

data &dataset.2;
          set &dataset;

          * search the string for the position of the space after the month;
           space_position = INDEX(&string_variable, ' ');

          * search the string for the position of the slash after the month;
           slash_position = INDEX(&string_variable, '/');

          * calculate the length between the space and the slash;
           space_to_slash = slash_position - space_position;

          * extract the day from the original string (the character(s) between the space and the slash;
           day = substr(&string_variable, space_position, space_to_slash);
run;

%mend getstr;

 

Let’s use the %getstr() macro program to create a new data set called “dates2” that contains the day of each date.  I’ll print the results afteward.

%getstr(dates, mydate);
proc print
     data = dates2
          noobs;
 run;

 

Here is the output; if you prefer, you can modify the macro program to drop the variables “space_position” and “substring_afterspace”.

mydate space_position substring_afterspace day
Jan 23/2 4 23/2 23
Aug 5/20 4 5/20 5
Dec 17/2 4 17/2 17

24 Responses to How to Extract a String Between 2 Characters in R and SAS

  1. Christoph says:

    I do not know SAS, but why not use the built-in R Datetime functions? Much faster, and standardised!


    # test dates
    mydate <- c('Jan 23/2012', 'Aug 5/2011', 'Dec 17/2011')

    # make sure to have the correct locate (as.Date %b works on abbreviated month name by locale)
    # you do not need this if you already have an English computer setup
    Sys.setlocale("LC_TIME", "English")
    # convert to R Date
    my.R.date <- as.Date(mydate, format="%b %e/%Y")
    # check vector output
    my.R.date

  2. Why not just use `sub` with a single regular expression? Is this solution any less robust?

    getstr = function(mystring, initial.character, final.character) {
    sub(sprintf(“.*%s(.*)%s.*”, initial.character, final.character), “\\1”, mystring)
    }

  3. Tiffany Timbers says:

    Could you also use str_extract() and sub() with regular expressions in R? For example, something like this?

    library(stringr)

    mydate = c(‘Jan 23/2012’, ‘Aug 5/2011’, ‘Dec 17/2011’)

    getstr <- function(mystring, initial.character, final.character) {
    the_pattern <- paste(initial.character,".*",final.character, sep="")
    snippet <- str_extract(mystring, the_pattern)
    snippet <- sub("^.", "", snippet)
    snippet <- sub(".$", "", snippet)
    return(snippet)
    }

    getstr(mydate, " ", "/")

  4. anspiess says:

    How about a one-liner using regular expression in which we eliminate word + space and / + digits and extract the numbers inbetween by backreferencing:

    > sub(“\\w* (\\d*)/\\d*”, “\\1”, mydate, perl = T)
    [1] “23” “5” “17”

    Cheers,
    Andrej

  5. Fëanor says:

    Or would this work?

    sapply(strsplit(mydate, “/”), function(x) unlist(strsplit(x[[1]], ” “)))[2,]

  6. Stefan says:

    May I recommend using stringr? It’s now based on the superior ICU library and has the more consistent syntax. But otherwise it’s the same as Andrej’s solution.

    library(stringr)
    str_replace(mydate, “.* (.*)/.*”, “\\1”)

  7. Regular Expressions really are the right tool for the job, per anspiess.

    Also in R, similar to anspiess solution, but preserving your original data and with the option of extracting each piece of the date:

    library(stringr)
    mydates <- c("Jan 23/2", "Aug 5/20", "Dec 17/2")
    str_match(mydates, "^.*\\s([^/]*).*$")

    [,1] [,2]
    [1,] "Jan 23/2" "23"
    [2,] "Aug 5/20" "5"
    [3,] "Dec 17/2" "17"

    str_match(mydates, "^(.*)\\s([^/]*)/(.*)$")

    [,1] [,2] [,3] [,4]
    [1,] "Jan 23/2" "Jan" "23" "2"
    [2,] "Aug 5/20" "Aug" "5" "20"
    [3,] "Dec 17/2" "Dec" "17" "2"

  8. Eric,

    I agree with Tiffany about the use of str_extract to accomplish this. I’d go one step further though and use str_exact without a function since regular expressions can always handle any pattern we’re thinking of. In this example, we can just use str_extract(my_dates, ‘[0-9]{1,2}(?=/[0-9]{4}’)) to get the values, where (?=/[0-9]{4}) is called a “negative look ahead”. The whole regular expression just says “I’m looking for 1-2 digits that are followed by a slash and then four digits”. str_extract will return only the 1-2 digits and not the pattern found in the negative look ahead. Negative look aheads (and negative look behinds) are very powerful and more information on those can be found here: http://www.regular-expressions.info/lookaround.html

    However, let’s go back to your initial code. We all love writing functions and coming up with creative ways to do things, it’s why we code. At least for me anyway. And I’ve always found it helpful when people not only provide alternative methods for doing what I’m trying to do, but also help me understand how my code can be better. So with that in mind, here’s how I would have written your code, just so you can see a slightly different approach:

    #new coding best practices suggest not using ‘.’ to separate words, but instead use “_”
    #for simplicity reasons, I’m going to refer to your second and third parameters as ic and fc
    get_str = function(my_string, ic, fc)
    {

    # check that all 3 inputs are character variables
    if (!all(is.character(my_string), is.character(ic), is.character(fc)))
    {
    stop(‘All three arguments much be character variables’)
    }

    # initializing an empty variable is more memory efficient than creating one with zeroes
    # initializing with zeroes is, in my opinion, only useful if we need to index during a for loop
    snippet <- NULL

    for (i in 1:length(mystring))
    {
    # grep( ) is a bit more powerful here because it's default return value is the index
    initial_position = grep(ic, my_string[i]) + 1
    final_position = grep(fc, my_string[i]) – 1

    # we can now just use the c( ) function to append our value to our snippet vector
    snippet = c(snippet, substr(my_string[i], initial_position, final_position))
    }

    return(snippet)
    }

    Again, my goal in rewriting your function was to give a new perspective on it to maybe shine a light on some R feature you might not have known about. Hope it helps!

    • Wow – thanks for taking the time to write this detailed reply, Bryan! I really appreciate your feedback!

      I ran your code. The function was fine, but, when I tried to use get_str(), I got the following error.

      Error in get_str(mydate, ” “, “/”) : object ‘mystring’ not found

      Any idea why this is happening? I can’t figure it out.

      Thanks again!

      • My pleasure! The issue here is that in the function definition, I changed the variable to be “my_string” instead of “mystring”. When I went through and changed all of the other variables to be “my_string”, I missed one. If you look at the beginning of the for loop, you’ll see that it still says “for (i in 1:length(mystring))” when it should say “for (i in 1:length(my_string))”. Changing that should fix the function and allow it to work as intended.

      • Ahh – I missed that, too. Thanks, Bryan! I really appreciate your time!

  9. Jeremy says:

    Cunningham’s Law in full effect in these comments.

    • Cunningham’s law: “The best way to get the right answer on the Internet is not to ask a question, it’s to post the wrong answer.”

      Hi Jeremy,

      I get the gist of what you’re saying. However, my functions do work – it’s just that the kind commenters in this post have shared better solutions. I take great care in ensuring that my functions work as intended when I post them. There may be better solutions – and I always hope so, as I can learn from the smart commenters on my blog – but mine still do work.

  10. msanregret says:

    Awesome work Eric! Love the SAS blogs, keep it up!

Your thoughtful comments are much appreciated!