简单尝试

我们这里先创建一个简单的数据框，两列，一列是记录，一列是具体的文本：

poem <- c("Roses are red,", "Violets are blue,", 
          "Sugar is sweet,", "And so are you.")
poem_df <- tibble(line = 1:4, text = poem)
> poem_df
# A tibble: 4 x 2
   line text             
  <int> <chr>            
1     1 Roses are red,   
2     2 Violets are blue,
3     3 Sugar is sweet,  
4     4 And so are you.

我们可以使用函数unnest_tokens:

该函数可以将指定数据框中的列中的文本进行全部拆分：

tbl 数据框
output 输出内容列名
input 输入内容列名

> unnest_tokens(tbl = poem_df,output =  word,input = text)
# A tibble: 13 x 2
    line word   
   <int> <chr>  
 1     1 roses  
 2     1 are    
 3     1 red    
 4     2 violets
 5     2 are    
 6     2 blue   
 7     3 sugar  
 8     3 is     
 9     3 sweet  
10     4 and    
11     4 so     
12     4 are    
13     4 you

接着我们可以删去一些常见的词，例如英语中的“the”，“of”，“to”等等。我们可以用一个删除停用词（保存在tidytext数据集中stop_words）：

> stop_words
# A tibble: 1,149 x 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ... with 1,139 more rows
> anti_join(result, stop_words)
Joining, by = "word"
# A tibble: 6 x 2
   line word   
  <int> <chr>  
1     1 roses  
2     1 red    
3     2 violets
4     2 blue   
5     3 sugar  
6     3 sweet

接着就是简单的计数统计：

> table(h$word)
   blue     red   roses   sugar   sweet violets 
      1       1       1       1       1       1

开始实战

我本来打算用简奥斯汀的小说来挖掘，刚好有个包：

library(janeaustenr)
> original_books <- austen_books()
> head(original_books)
# A tibble: 6 x 2
  text                    book               
  <chr>                   <fct>              
1 "SENSE AND SENSIBILITY" Sense & Sensibility
2 ""                      Sense & Sensibility
3 "by Jane Austen"        Sense & Sensibility
4 ""                      Sense & Sensibility
5 "(1811)"                Sense & Sensibility
6 ""                      Sense & Sensibility
> table(original_books$book)
Sense & Sensibility   Pride & Prejudice      Mansfield Park 
              12624               13030               15349 
               Emma    Northanger Abbey          Persuasion 
              16235                7856                8328

然而觉得这样就索然无味了，就是需要原生的txt 文本从数据清洗开始处理才有意思：

http://www.qcenglish.com/ebook/239.html

这里以《简爱》为例子：

031. R 的文本挖掘 - 图2

简单的看看排名靠前的非常见词文本：

p_load(rvest, pengToolkit, wordcloud2, jiebaR, tidytext, dplyr, janeaustenr); beepr::beep(sound = "coin")
# 读取全部文本
lines <- readLines("Jane Eyre - Charlotte Bronte.txt")
# 将向量转为tibble
text_df <- tibble(line = 1:length(lines), text = lines)
# 拆分所有语句，获得文字内容
result <- unnest_tokens(tbl = text_df,output =  word,input = text)
# 删除常见词
result <- anti_join(result, stop_words)
# 统计词频率
table(result$word)
  door   reed   time    day bessie   miss 
    54     65     66     67     86    149

还可以画个云图：

# 云图可视化结果
top_gene_freq_db <- as.data.frame(sort(table(result$word), decreasing = T))
wordcloud2(subset(top_gene_freq_db, Freq >= 20))

031. R 的文本挖掘 - 图3

还是非常简单的！