主要学习stringr包的用法。

构建字符串

安装stringr包,构建示例字符串。

  1. > rm(list = ls())
  2. > if(!require(stringr))install.packages('stringr')
  3. > library(stringr)
  4. > x <- "The birch canoe slid on the smooth planks."
  5. > x
  6. [1] "The birch canoe slid on the smooth planks."

检测长度

主要是str_length()函数。

  1. > str_length(x) #空格和符号也计算在内
  2. [1] 42
  3. > length(x)
  4. [1] 1

image.png

字符串拆分与组合

主要是str_split()函数,拆分后可以取子集。

  1. > x
  2. [1] "The birch canoe slid on the smooth planks."
  3. > str_split(x," ")
  4. [[1]]
  5. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
  6. > x2 = str_split(x," ")[[1]]
  7. > x2
  8. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."

注意simplify参数的用法,可以形成一个矩阵。

  1. > y = c("jimmy 150","nicker 140","tony 152")
  2. > str_split(y," ")
  3. [[1]]
  4. [1] "jimmy" "150"
  5. [[2]]
  6. [1] "nicker" "140"
  7. [[3]]
  8. [1] "tony" "152"
  9. > str_split(y," ",simplify = T)
  10. [,1] [,2]
  11. [1,] "jimmy" "150"
  12. [2,] "nicker" "140"
  13. [3,] "tony" "152"

str_c()函数可以实现字符串的连接。
collapse参数可以作为分隔符,sep是把向量中的每一个元素都与另一字符串相连。

  1. > x2
  2. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
  3. > str_c(x2,collapse = " ")
  4. [1] "The birch canoe slid on the smooth planks."
  5. > str_c(x2,1234,sep = "+")
  6. [1] "The+1234" "birch+1234" "canoe+1234" "slid+1234" "on+1234"
  7. [6] "the+1234" "smooth+1234" "planks.+1234"

提取字符串的一部分

str_sub()函数可以从起始位置到终止位置取字符。

  1. > x
  2. [1] "The birch canoe slid on the smooth planks."
  3. > str_sub(x,5,9)
  4. [1] "birch"

字符定位

str_locate()在向量中定位需要检测的字符。

  1. > x2
  2. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
  3. > str_locate(x2,"th")
  4. start end
  5. [1,] NA NA
  6. [2,] NA NA
  7. [3,] NA NA
  8. [4,] NA NA
  9. [5,] NA NA
  10. [6,] 1 2
  11. [7,] 5 6
  12. [8,] NA NA
  13. > str_locate(x2,"h")
  14. start end
  15. [1,] 2 2
  16. [2,] 5 5
  17. [3,] NA NA
  18. [4,] NA NA
  19. [5,] NA NA
  20. [6,] 2 2
  21. [7,] 6 6
  22. [8,] NA NA

字符检测

str_detect()检测向量中每个字符串是否含有待检测的字符。

  1. > x2
  2. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
  3. > str_detect(x2,"h")
  4. [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
  5. > ###与sum和mean连用,可以统计匹配的个数和比例
  6. > sum(str_detect(x2,"h"))
  7. [1] 4
  8. > mean(str_detect(x2,"h"))
  9. [1] 0.5

字符替换

  1. > x2
  2. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
  3. > str_replace(x2,"o","A")
  4. [1] "The" "birch" "canAe" "slid" "An" "the" "smAoth" "planks."
  5. > str_replace_all(x2,"o","A")
  6. [1] "The" "birch" "canAe" "slid" "An" "the" "smAAth" "planks."

提取匹配到的字符

  1. > x2
  2. [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
  3. > str_extract(x2,"o|e")
  4. [1] "e" NA "o" NA "o" "e" "o" NA
  5. > str_extract_all(x2,"o|e")
  6. [[1]]
  7. [1] "e"
  8. [[2]]
  9. character(0)
  10. [[3]]
  11. [1] "o" "e"
  12. [[4]]
  13. character(0)
  14. [[5]]
  15. [1] "o"
  16. [[6]]
  17. [1] "e"
  18. [[7]]
  19. [1] "o" "o"
  20. [[8]]
  21. character(0)
  22. > str_extract_all(x2,"o|e",simplify = T)
  23. [,1] [,2]
  24. [1,] "e" ""
  25. [2,] "" ""
  26. [3,] "o" "e"
  27. [4,] "" ""
  28. [5,] "o" ""
  29. [6,] "e" ""
  30. [7,] "o" "o"
  31. [8,] "" ""

字符删除

  1. > x
  2. [1] "The birch canoe slid on the smooth planks."
  3. > str_remove(x," ")
  4. [1] "Thebirch canoe slid on the smooth planks."
  5. > str_remove_all(x," ")
  6. [1] "Thebirchcanoeslidonthesmoothplanks."