Useful Functions in R for Manipulating Text Data
February 27, 2014 12 Comments
In my current job, I study HIV at the genetic and biochemical levels. Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text. (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from the HIV’s RNA.) In this post, I describe some common functions in R that I often use for text processing.
Obtaining Basic Information about Character Variables
> year = 2014 > is.character(year)  FALSE
If a variable is not a character variable, you can convert it to a character variable using the as.character() function.
> year.char = as.character(year) > is.character(year.char)  TRUE
A basic piece of information about a character variable is the number of characters that exist in this string. Use the nchar() function to obtain this information.
> nchar(year.char)  4
Combining and Splitting Text
I often need to combine several character variables into one string, and the paste() function is useful for that. Notice my use of the “sep =” option to specify that I want to separate the variables with 1 space.
> first = 'The' > second = 'Chemical' > third = 'Statistician' > my.name = paste(first, second, third, sep = ' ') > my.name  "The Chemical Statistician"
The opposite task is to split 1 string into many shorter strings. The strsplit() function is useful for that. Notice my use of the “split = ” option to specify that a single space is the delimiter. You can split multiple variables simultaneously by combining them with the c() function. Allow me to first create a new variable called real.name to illustrate this capability.
> real.name = paste('Eric', 'Cai', sep = ' ') > parts = strsplit(c(my.name, real.name), split = ' ') > parts []  "The" "Chemical" "Statistician" []  "Eric" "Cai"
The output is a list of objects. To obtain the individual strings, simply slice each object out of the list with double square brackets, then extract each string from each list with single square brackets. For example, to obtain the 2nd string from the decomposition of my.name,
> parts[]  "Chemical"
Another way to split a string is to use the scan() function. Note, however, that this option is much slower, and I discourage using it with apply() to split strings simultaneously, especially if is a very large number.
> my.name.parts = scan(text = my.name, what = 'character', sep = ' ') Read 3 items > my.name.parts  "The" "Chemical" "Statistician" > my.name.parts  "Chemical"
Pattern Matching and Manipulation
A common task in my job is determining whether or not a sequence of nucleotides/amino acids is present in a much longer sequence of length (i.e. ). Essentially, I want to determine if a pattern of text exists in a character variable. The grepl() function is useful for that; in fact, the pattern of interest can be searched in multiple character variables simultaneously – just combine the 2 variables using the c() function!
> x = 'ATCG' > y = 'GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT' > z = 'CTATCGGGTAGCT' > grepl(x, c(y, z))  TRUE TRUE
If you want to determine precisely where “x” is located along “y” and along “z”, use the gregexpr() function.
> gregexpr(x, c(y, z)) []  19 25 attr(,"match.length")  4 4 attr(,"useBytes")  TRUE []  3 attr(,"match.length")  4 attr(,"useBytes")  TRUE
The output of gregexpr(x, c(y, z)) is a list of 2 objects.
- The first object contains the positional information about the pattern “x” in the variable “y”.
- “x” appears twice in the variable “y” – at positions 19 and 25. (Specifically, the “A” in x = ‘ATCG’ appears at positions 19 and 25.)
- The second object contains the positional information about the pattern “x” in the variable “z”.
To extract these positions, you must first slice the list into its 2 objects – use double square brackets to do this. Then, you can extract the positions from each object – use single square brackets to do this. For simplicity, let’s assign the output of gregexpr(x, c(y, z)) to a variable named “pos”.
> pos = gregexpr(x, c(y, z)) > pos[]  19 25 attr(,"match.length")  4 4 attr(,"useBytes")  TRUE > pos[]  19 > pos[]  25
If you want to extract a portion of a string, use the substr() function. For example, if I know that the first 3 nucleotides of a particular DNA sequence are junk, I would want to discard them and extract the rest of that sequence only. Let’s use the variable “y” to illustrate this.
> y  "GGACTCTAAATCCGTACTATCGTCATCGTTTTTCCT" > substr(y, 4, nchar(y))  "CTCTAAATCCGTACTATCGTCATCGTTTTTCCT"
John Myles White, who co-wrote the excellent “Machine Learning for Hackers” with Drew Conway, has a nice blog entry on some other useful functions for text processing in R. If you have any more suggestions, please share them in the comments!