Extract substrings defined by regular expressions from a vector of strings
Arguments
- pattern
regular expression containing parts in pairs of opening and closing parentheses defining the part(s) to be extracted
- x
vector of character strings
- index
index(es) of parenthesized subexpression(s) to be extracted. If the length of
x
is greater than one a data frame is returned with each column containing the substrings matching the subexpression at the corresponding index. Ifindex
is named, the names will be used as column names.- stringsAsFactors
if
TRUE
(default is FALSE) and a data frame is returned then the columns in the returned data frame are of factors, otherwise vectors of character.
Examples
# Define pattern matching a date
pattern <- "([^ ]+), ([0-9]+) of ([^ ]+)"
# Extract single sub expressions from one string
datestring <- "Thursday, 8 of December"
extractSubstring(pattern, datestring, 1) # ""Thursday""
#> [1] "Thursday"
extractSubstring(pattern, datestring, 2) # "8"
#> [1] "8"
extractSubstring(pattern, datestring, 3) # "December"
#> [1] "December"
# Extract single sub expressions from a vector of strings
datestrings <- c("Thursday, 8 of December", "Tuesday, 14 of January")
extractSubstring(pattern, datestrings, 1) # "Thursday" "Tuesday"
#> [1] "Thursday" "Tuesday"
extractSubstring(pattern, datestrings, 2) # "8" "14"
#> [1] "8" "14"
extractSubstring(pattern, datestrings, 3) # "December" "January"
#> [1] "December" "January"
# Extract more than one subexpression at once -> data.frame
extractSubstring(pattern, datestrings, 1:3)
#> subexp.1 subexp.2 subexp.3
#> 1 Thursday 8 December
#> 2 Tuesday 14 January
# subexp.1 subexp.2 subexp.3
# 1 Thursday 8 December
# 2 Tuesday 14 January
# Name the sub expressions by naming their number in index (3rd argument)
extractSubstring(pattern, datestrings, index = c(weekday = 1, 2, month = 3))
#> weekday subexp.2 month
#> 1 Thursday 8 December
#> 2 Tuesday 14 January
# weekday subexp.2 month
# 1 Thursday 8 December
# 2 Tuesday 14 January