基本用法

查看长度

x <- "The birch canoe slid on the smooth planks."
length(x)
str_length(x)
> length(x)
[1] 1
> str_length(x)
[1] 42

length 只会返回出x 中的元素数（长度为1 的字符串类型的向量）。

str_length 才会返回字符串长度。

拆分与组合

拆分

需要注意的是，提取拆分后的元素需要使用 [[]] 双括号选择。

x <- "The birch canoe slid on the smooth planks."
str_split(x," ")
x2 = str_split(x," ")[[1]]
# 此时x2 是一个包含x中所有单词（用空格拆分了）的向量。
> length(x2)
[1] 8

合并

collapse 设定合并向量中内容使用的分隔符。

str_c(x2,collapse = " ")

collapse 参数设定分离的元素结合成一个字符串分离的符号。

还可以将两个向量中的元素，或向量和另外一个字符串进行合并。

str_c(x2,1234,sep = "+")

sep 参数设定某两个分隔的元素连接，使用某符号。、

如果是长度不相等的两个向量合并，则会循环连接（挨个对上，而非全部对上）：

> c
 [1]  9 10  1  8  4  5  6  2 12 11  7  3
> b
[1] "a"     "ppple"
> str_c(c,b, sep='+')
 [1] "9+a"      "10+ppple" "1+a"      "8+ppple"  "4+a"      "5+ppple" 
 [7] "6+a"      "2+ppple"  "12+a"     "11+ppple" "7+a"      "3+ppple"

提取字符串

str_sub(string, start = 1L, end = -1L) ，从start 到end，为提取的string 中字符在字符串中的位置。（空格也占位！）

x <- "The birch canoe slid on the smooth planks."
str_sub(x,5,9)

大小写转换

upper 大写，lower 大写，title 首字母大写。

str_to_upper(x2)
str_to_lower(x2)
str_to_title(x2)

字符串排序

默认按照英文字母或数字大小顺序。

str_sort(x2)

空白处理

stringr::str_trim(string, side) 返回删去字符型向量 string 每个元素的首尾空格的结果，可以用 side 指定删除首尾空格(“both”)、开头空格(“left”)、末尾空格(“right”)。如:

006.02 stringr 处理字符串数据 - 图1

stringr::str_squish(string) 对字符型向量 string 每个元素，将重复空格变成单个，返回变换后的结果。

高级用法

字符检测

对字符串分隔后的向量与待检测的字符进行比较，生成等长的逻辑值向量。

detect 检测全字符，starts 检测首字母，ends 检测末字母。

str_detect(x2,"h")
str_starts(x2,"T")
str_ends(x2,"e")
> str_detect(x2,"h")
[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE

还可以联合 table 统计TRUE 与FALSE 的数目。

> table(str_detect(x2,"h"))
FALSE  TRUE 
    4     4

另外，也可以联合 sum 获取TRUE 数目， mean 获得TRUE 比例。

> sum(str_detect(x2,"h"))
[1] 4
> mean(str_detect(x2,"h"))
[1] 0.5

提取匹配字符

将向量中符合要求的元素提取为一个新的向量。

> x <- str_subset(x2,"h")
> x
[1] "The"    "birch"  "the"    "smooth"

ps：匹配和检测支持正则：

006.02 stringr 处理字符串数据 - 图2

字符计数

计算字符串内指定字符出现次数。

str_count(x," ")
str_count(x2,"o")
> str_count(x," ")
[1] 7
> str_count(x2,"o")
[1] 0 0 1 0 1 0 2 0

字符替换与删除

replace 会替换字符串中指定的第一个字符。

而replace_all 则替换全部符合的字符。

str_replace(x2,"o","A")
str_replace_all(x2,"o","A")
> str_replace(x2,"o","A")
[1] "The"     "birch"   "canAe"   "slid"    "An"      "the"    
[7] "smAoth"  "planks."
> str_replace_all(x2,"o","A")
[1] "The"     "birch"   "canAe"   "slid"    "An"      "the"    
[7] "smAAth"  "planks."
> > str_remove(x2, "[.]")
[1] "The"    "birch"  "canoe"  "slid"   "on"     "the"   
[7] "smooth" "planks"

str_remove 可以将指定的某个字符串从字符串中删除。

练习题

6-2

#练习6-2
library(stringr)
#Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community.
tmp <- "Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community."
#1.将上面这句话作为一个长字符串，赋值给tmp
#2.拆分为一个由单词组成的向量，赋值给tmp2(注意标点符号)
tmp2 <- tmp %>% 
  str_replace(',', ' ') %>% 
  str_replace('\\.','') %>% 
  str_split(" ") 
# 直接用或连字符， str_split(" |,") 可以省去替换, 操作
tmp2 <- tmp2[[1]]
tmp2
#3.用函数返回这句话中有多少个单词。
length(tmp2)
#4.用函数返回这句话中每个单词由多少个字母组成。
str_length(tmp2)
#5.统计tmp2有多少个单词中含有字母"e"
sum(str_detect(tmp2, "e"))