用R语言处理字符串,总觉得很麻烦,即不能用向量的方法进行分割,也不能用循环遍历索引。grep()家族函数常常记不住,paste()函数默认以空格分割,各种不顺手啊!随着使用R语言的场景越来越多,字符串处理是必不可少的。stringr 包被定义为一致的、简单易用的字符串工具集。所有的函数和参数定义都具有一致性,更容易方便记忆和使用。

字符串计算函数

str_length: 字符串长度 str_count: 字符串计数 str_order: 字符串排序

str_length,字符串长度

  • 函数定义:
  1. str_length(string)
  • 参数列表:
    string: 字符串,字符串向量。
  • 使用示例:
  1. > str_length(c("I", "am", NA))
  2. [1] 1 2 2 NA

str_count, 字符串计数

  • 函数定义:
  1. str_count(string, pattern = "")
  • 参数列表:
    string: 字符串,字符串向量。
    pattern: 匹配的字符。
  • 使用示例:

对字符串中匹配的字符计数

  1. > str_count('aaa444sssddd', "a")
  2. [1] 3

对字符串向量中匹配的字符计数

  1. > fruit <- c("apple", "banana", "pear", "pineapple")
  2. > str_count(fruit, "a")
  3. [1] 1 3 1 1

对于一些特殊字符如:”.”,需要进行正则匹配

str_order, 字符串索引排序

  • 函数定义:
  1. str_order(x, decreasing = FALSE, na_last = TRUE, locale = "en", numeric = FALSE, ...)
  2. str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "en", numeric = FALSE, ...)
  • 参数列表:
    x: 字符串,字符串向量。
    decreasing: 排序方向。
    na_last:NA值的存放位置,一共3个值,TRUE放到最后,FALSE放到最前,NA过滤处理
    locale:按哪种语言习惯排序
    numeric: 是否只按照数字大小进行排序
  • 使用示例:
  1. > str_sort(letters[1:5])
  2. [1] "a" "b" "c" "d" "e"
  3. > str_order(letters[1:5])
  4. [1] 1 2 3 4 5
  5. ## 按照数字进行排序
  6. > x <- c("100a10", "100a5", "2b", "2a")
  7. > str_sort(x)
  8. [1] "100a10" "100a5" "2a" "2b"
  9. > str_sort(x, numeric = TRUE)
  10. [1] "2a" "2b" "100a5" "100a10"

字符串拼接函数

str_c: 字符串拼接 str_trim: 去掉字符串的空格和TAB(\t) str_pad: 补充字符串的长度 str_dup: 复制字符串 str_wrap: 控制字符串输出格式 str_sub: 截取字符串

str_c,字符串拼接

  • 函数定义:
  1. str_c(..., sep = "", collapse = NULL)
  • 参数列表:
    …: 多参数的输入
    sep: 把多个字符串拼接为一个大的字符串,用于字符串的分割符
    collapse: 把多个向量参数拼接为一个大的字符串,用于字符串的分割符
  • 使用示例:

把多个字符串拼接为一个大的字符串。

  1. > str_c('a','b')
  2. [1] "ab"
  3. > str_c('a','b',sep='-')
  4. [1] "a-b"
  5. > str_c(c('a','a1'),c('b','b1'),sep='-')
  6. [1] "a-b" "a1-b1"

把多个向量参数拼接为一个大的字符串。

  1. > str_c(head(letters), collapse = "")
  2. [1] "abcdef"
  3. # collapse参数,对多个字符串无效
  4. > str_c('a','b',collapse = "-")
  5. [1] "ab"
  6. > str_c(c('a','a1'),c('b','b1'),collapse='-')
  7. [1] "ab-a1b1"

拼接有NA值的字符串向量时,NA还是NA

  1. > str_c(c("a", NA, "b"), "-d")
  2. [1] "a-d" NA "b-d"
  3. # Use str_replace_NA to display literal NAs:
  4. > str_c(str_replace_na(c("a", NA, "b")), "-d")
  5. [1] "a-d" "NA-d" "b-d"

str_flatten: 字符串快速拼接函数

  • 函数定义:
  1. str_flatten(string, collapse = "")
  • 参数列表:
    string: 字符串,字符串向量
    collapse: 拼接字符串之间插入的字符
  • 使用示例:
  1. > str_flatten(letters)
  2. [1] "abcdefghijklmnopqrstuvwxyz"
  3. > str_flatten(letters, "-")
  4. [1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"

str_trim:去掉字符串的空格和TAB(\t)

str_trim() 从字符串的开头和结尾删除空格; str_squish() 同时还可以删除字符串内部的空格

  • 函数定义:
  1. str_trim(string, side = c("both", "left", "right"))
  2. str_squish(string)
  • 参数列表:
    string: 字符串,字符串向量
    side: 过滤方式,both两边都过滤,left左边过滤,right右边过滤
  • 使用示例:
  1. > str_trim("\n\nString with trailing and leading white space\n\n")
  2. [1] "String with trailing, middle, and leading white space"
  3. > str_squish("\n\nString with excess, trailing and leading white space\n\n")
  4. [1] "String with trailing, middle, and leading white space"

str_pad:补充字符串的长度

  • 函数定义:
  1. str_pad(string, width, side = c("left", "right", "both"), pad = " ")
  • 参数列表:
    string: 字符串,字符串向量
    width: 字符串填充后的长度
    side: 填充方向,both两边都填充,left左边填充,right右边填充
    pad: 用于填充的字符
  • 使用示例:
  1. # 从左边补充空格,直到字符串长度为20
  2. > str_pad("conan", 20, "left")
  3. [1] " conan"
  4. # 从右边补充空格,直到字符串长度为20
  5. > str_pad("conan", 20, "right")
  6. [1] "conan "
  7. # 从左右两边各补充空格,直到字符串长度为20
  8. > str_pad("conan", 20, "both")
  9. [1] " conan "
  10. # 从左右两边各补充x字符,直到字符串长度为20
  11. > str_pad("conan", 20, "both",'x')
  12. [1] "xxxxxxxconanxxxxxxxx"

str_dup: 复制字符串

  • 函数定义:
  1. str_dup(string, times)
  • 参数列表:
    string: 字符串,字符串向量
    times: 复制数量
  • 使用示例:
  1. > fruit <- c("apple", "pear", "banana")
  2. > str_dup(fruit, 2)
  3. [1] "appleapple" "pearpear" "bananabanana"

str_wrap,控制字符串输出格式

  • 函数定义:
  1. str_wrap(string, width = 80, indent = 0, exdent = 0)
  • 参数列表:
    string: 字符串,字符串向量
    width: 设置一行所占的宽度
    indent: 段落首行的缩进值
    exdent: 段落非首行的缩进值

str_sub,截取字符串

  • 函数定义:
  1. str_sub(string, start = 1L, end = -1L)
  2. str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value
  • 参数列表:
    string: 字符串,字符串向量
    start : 开始位置
    end : 结束位置
  • 使用示例:
  1. > hw <- "Hadley Wickham"
  2. > str_sub(hw, 1, 6)
  3. [1] "Hadley"
  4. > str_sub(hw, end = 6)
  5. [1] "Hadley"
  6. > str_sub(hw, c(1, 8), c(6, 14))
  7. [1] "Hadley" "Wickham"
  8. > str_sub(hw, -1)
  9. [1] "m"
  10. > str_sub(hw, -7)
  11. [1] "Wickham"
  12. > str_sub(hw, end = -7)
  13. [1] "Hadley W"

字符串匹配函数

str_split: 字符串分割 str_subset: 返回匹配的字符串 word: 从文本中提取单词 str_detect: 检查匹配字符串的字符 str_match: 从字符串中提取匹配组 str_replace: 字符串替换 str_replace_na:把NA替换为NA字符串 str_locate: 找到匹配的字符串的位置 str_extract: 从字符串中提取匹配字符

str_split,字符串分割

  • 函数定义:
  1. str_split(string, pattern, n = Inf, simplify = FALSE)
  2. str_split_fixed(string, pattern, n)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配的字符
    n: 分割个数
    simplify: FALSE以list返回结果,TRUE以matrix返回结果
  • 使用示例:
  1. > fruits <- c("apples and oranges and pears and bananas", "pineapples and mangos and guavas")
  2. > str_split(fruits, " and ")
  3. [[1]]
  4. [1] "apples" "oranges" "pears" "bananas"
  5. [[2]]
  6. [1] "pineapples" "mangos" "guavas"
  7. > str_split(fruits, " and ", simplify = TRUE)
  8. [,1] [,2] [,3] [,4]
  9. [1,] "apples" "oranges" "pears" "bananas"
  10. [2,] "pineapples" "mangos" "guavas" ""
  11. > str_split(fruits, " and ", n = 3)
  12. [[1]]
  13. [1] "apples" "oranges" "pears and bananas"
  14. [[2]]
  15. [1] "pineapples" "mangos" "guavas"
  16. > str_split_fixed(fruits, " and ", 3)
  17. [,1] [,2] [,3]
  18. [1,] "apples" "oranges" "pears and bananas"
  19. [2,] "pineapples" "mangos" "guavas"
  20. > str_split_fixed(fruits, " and ", 4)
  21. [,1] [,2] [,3] [,4]
  22. [1,] "apples" "oranges" "pears" "bananas"
  23. [2,] "pineapples" "mangos" "guavas" ""

str_subset:返回的匹配字符串

  • 函数定义:
  1. str_subset(string, pattern, negate = FALSE)
  2. str_which(string, pattern, negate = FALSE)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配的字符
    negate: 如果为TRUE,返回未匹配的元素
  • 使用示例:
  1. > fruit <- c("apple", "banana", "pear", "pinapple")
  2. > str_subset(fruit, "a")
  3. [1] "apple" "banana" "pear" "pinapple"
  4. > str_which(fruit, "a")
  5. [1] 1 2 3 4
  6. # 支持正则表达式
  7. > str_subset(fruit, "^a")
  8. [1] "apple"
  9. > str_subset(fruit, "a$")
  10. [1] "banana"

word, 从文本中提取单词

  • 函数定义
  1. word(string, start = 1L, end = start, sep = fixed(" "))
  • 参数列表:
    string: 字符串,字符串向量
    start: 开始位置
    end: 结束位置
    sep: 匹配字符
  • 使用示例:
  1. > sentences <- c("Jane saw a cat", "Jane sat down")
  2. > word(sentences, 1)
  3. [1] "Jane" "Jane"
  4. > word(sentences, 2)
  5. [1] "saw" "sat"
  6. > word(sentences, -1)
  7. [1] "cat" "down"
  8. > word(sentences, 2, -1)
  9. [1] "saw a cat" "sat down"
  10. # 匹配以..分割开的word
  11. > str <- 'abc.def..123.4568.999'
  12. > word(str, 1, sep = fixed('..'))
  13. [1] "abc.def"

str_detect,检查字符串是否出现

  • 函数定义:
  1. str_detect(string, pattern, negate = FALSE)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配字符
    negate: 如果为TRUE,返回未匹配的元素
  • 使用示例:
  1. > fruit <- c("apple", "banana", "pear", "pinapple")
  2. > str_detect(fruit, "a")
  3. [1] TRUE TRUE TRUE TRUE
  4. > str_detect(fruit, "^a")
  5. [1] TRUE FALSE FALSE FALSE
  6. > str_detect(fruit, "^p", negate = TRUE)
  7. [1] TRUE TRUE FALSE FALSE

str_match,从字符串中提取匹配组

  • 函数定义:
  1. str_match(string, pattern)
  2. str_match_all(string, pattern)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配字符
  • 使用示例:
  1. > val <- c("abc", 123, "cba")
  2. > str_match(val, "a")
  3. [,1]
  4. [1,] "a"
  5. [2,] NA
  6. [3,] "a"
  7. > str_match(val, "[0-9]")
  8. [,1]
  9. [1,] NA
  10. [2,] "1"
  11. [3,] NA
  12. > str_match_all(val, "a")
  13. [[1]]
  14. [,1]
  15. [1,] "a"
  16. [[2]]
  17. [,1]
  18. [[3]]
  19. [,1]
  20. [1,] "a"

str_replace,字符串替换

  • 函数定义:
  1. str_replace(string, pattern, replacement)
  2. str_replace_all(string, pattern, replacement)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配字符
    replacement: 用于替换的字符
  • 使用示例:
  1. > fruits <- c("one apple", "two pears", "three bananas")
  2. > str_replace(fruits, "[aeiou]", "-")
  3. [1] "-ne apple" "tw- pears" "thr-e bananas"
  4. > str_replace_all(fruits, "[aeiou]", "-")
  5. [1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"

str_replace_na把NA替换为NA字符串

  • 函数定义:
  1. str_replace_na(string, replacement = "NA")
  • 参数列表:
    string: 字符串,字符串向量
    replacement : 用于替换的字符
  • 使用示例:
  1. > str_replace_na(c(NA, "abc", "def"))
  2. [1] "NA" "abc" "def"

str_locate,找到的模式在字符串中的位置

  • 函数定义:
  1. str_locate(string, pattern)
  2. str_locate_all(string, pattern)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配字符
  • 使用示例:
  1. > fruit <- c("apple", "banana", "pear", "pineapple")
  2. > str_locate(fruit, "a") # 在每个字符串中a的位置
  3. start end
  4. [1,] 1 1
  5. [2,] 2 2
  6. [3,] 3 3
  7. [4,] 5 5
  8. > str_locate_all(fruit, "a") # 在每个字符串中所有a的位置
  9. [[1]]
  10. start end
  11. [1,] 1 1
  12. [[2]]
  13. start end
  14. [1,] 2 2
  15. [2,] 4 4
  16. [3,] 6 6
  17. [[3]]
  18. start end
  19. [1,] 3 3
  20. [[4]]
  21. start end
  22. [1,] 5 5

str_extract从字符串中提取匹配模式

  • 函数定义:
  1. str_extract(string, pattern)
  2. str_extract_all(string, pattern, simplify = FALSE)
  • 参数列表:
    string: 字符串,字符串向量
    pattern: 匹配字符
    simplify: 返回值,TRUE返回matrix,FALSE返回字符串向量
  • 使用示例:
  1. > shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
  2. > str_extract(shopping_list, "\\d")
  3. [1] "4" NA NA "2"
  4. > str_extract_all(shopping_list, "\\d")
  5. [[1]]
  6. [1] "4"
  7. [[2]]
  8. character(0)
  9. [[3]]
  10. character(0)
  11. [[4]]
  12. [1] "2"
  13. > str_extract_all(shopping_list, "\\d", simplify = TRUE)
  14. [,1]
  15. [1,] "4"
  16. [2,] ""
  17. [3,] ""
  18. [4,] "2"

字符串变换函数

str_conv: 字符编码转换 str_to_upper: 字符串转成大写 str_to_lower: 字符串转成小写,规则同str_to_upper str_to_title: 字符串转成首字母大写,规则同str_to_upper

str_conv:字符编码转换

  • 函数定义:
  1. str_conv(string, encoding)
  • 参数列表:
    string: 字符串,字符串向量
    encoding: 编码名
  • 使用示例:
  1. > x <- charToRaw('你好')
  2. > x
  3. [1] e4 bd a0 e5 a5 bd
  4. > str_conv(x, "GBK")
  5. [1] "你好"
  6. > str_conv(x, "GB2312")
  7. [1] "你好"

str_to_upper,字符串大写转换

  • 函数定义:
  1. str_to_upper(string, locale = "en")
  2. str_to_lower(string, locale = "en")
  3. str_to_title(string, locale = "en")
  4. str_to_sentence(string, locale = "en")
  • 参数列表:
    string: 字符串
    locale:按哪种语言习惯排序
  • 使用示例:
  1. > dog <- "The quick brown dog"
  2. > str_to_upper(dog)
  3. [1] "THE QUICK BROWN DOG"
  4. > str_to_lower(dog)
  5. [1] "the quick brown dog"
  6. > str_to_title(dog)
  7. [1] "The Quick Brown Dog"
  8. > str_to_sentence("the quick brown dog")
  9. [1] "The quick brown dog"