Skip to contents

Extract substrings defined by regular expressions from a vector of strings

Usage

extractSubstring(pattern, x, index, stringsAsFactors = FALSE)

Arguments

pattern

regular expression containing parts in pairs of opening and closing parentheses defining the part(s) to be extracted

x

vector of character strings

index

index(es) of parenthesized subexpression(s) to be extracted. If the length of x is greater than one a data frame is returned with each column containing the substrings matching the subexpression at the corresponding index. If index is named, the names will be used as column names.

stringsAsFactors

if TRUE (default is FALSE) and a data frame is returned then the columns in the returned data frame are of factors, otherwise vectors of character.

Examples

# Define pattern matching a date
pattern <- "([^ ]+), ([0-9]+) of ([^ ]+)"

# Extract single sub expressions from one string
datestring <- "Thursday, 8 of December"
extractSubstring(pattern, datestring, 1) # ""Thursday""
#> [1] "Thursday"
extractSubstring(pattern, datestring, 2) # "8"
#> [1] "8"
extractSubstring(pattern, datestring, 3) # "December"
#> [1] "December"

# Extract single sub expressions from a vector of strings
datestrings <- c("Thursday, 8 of December", "Tuesday, 14 of January")
extractSubstring(pattern, datestrings, 1) # "Thursday" "Tuesday"
#> [1] "Thursday" "Tuesday" 
extractSubstring(pattern, datestrings, 2) # "8"  "14"
#> [1] "8"  "14"
extractSubstring(pattern, datestrings, 3) # "December" "January" 
#> [1] "December" "January" 

# Extract more than one subexpression at once -> data.frame
extractSubstring(pattern, datestrings, 1:3)
#>   subexp.1 subexp.2 subexp.3
#> 1 Thursday        8 December
#> 2  Tuesday       14  January

#   subexp.1 subexp.2 subexp.3
#   1 Thursday        8 December
#   2  Tuesday       14  January

# Name the sub expressions by naming their number in index (3rd argument)
extractSubstring(pattern, datestrings, index = c(weekday = 1, 2, month = 3))
#>    weekday subexp.2    month
#> 1 Thursday        8 December
#> 2  Tuesday       14  January
#    weekday subexp.2    month
# 1 Thursday        8 December
# 2  Tuesday       14  January