参见:https://blog.csdn.net/weixin_38008864/article/details/108138494

https://www.jianshu.com/p/6abaf3167cfe

031. R 的文本挖掘 - 图1

直接安装即可。还有一本专门介绍的书:

https://www.tidytextmining.com/

简单尝试

我们这里先创建一个简单的数据框,两列,一列是记录,一列是具体的文本:

  1. poem <- c("Roses are red,", "Violets are blue,",
  2. "Sugar is sweet,", "And so are you.")
  3. poem_df <- tibble(line = 1:4, text = poem)
  4. > poem_df
  5. # A tibble: 4 x 2
  6. line text
  7. <int> <chr>
  8. 1 1 Roses are red,
  9. 2 2 Violets are blue,
  10. 3 3 Sugar is sweet,
  11. 4 4 And so are you.

我们可以使用函数unnest_tokens:

该函数可以将指定数据框中的列中的文本进行全部拆分:

  • tbl 数据框
  • output 输出内容列名
  • input 输入内容列名
  1. > unnest_tokens(tbl = poem_df,output = word,input = text)
  2. # A tibble: 13 x 2
  3. line word
  4. <int> <chr>
  5. 1 1 roses
  6. 2 1 are
  7. 3 1 red
  8. 4 2 violets
  9. 5 2 are
  10. 6 2 blue
  11. 7 3 sugar
  12. 8 3 is
  13. 9 3 sweet
  14. 10 4 and
  15. 11 4 so
  16. 12 4 are
  17. 13 4 you

接着我们可以删去一些常见的词,例如英语中的“the”,“of”,“to”等等。我们可以用一个删除停用词(保存在tidytext数据集中stop_words):

  1. > stop_words
  2. # A tibble: 1,149 x 2
  3. word lexicon
  4. <chr> <chr>
  5. 1 a SMART
  6. 2 a's SMART
  7. 3 able SMART
  8. 4 about SMART
  9. 5 above SMART
  10. 6 according SMART
  11. 7 accordingly SMART
  12. 8 across SMART
  13. 9 actually SMART
  14. 10 after SMART
  15. # ... with 1,139 more rows
  16. > anti_join(result, stop_words)
  17. Joining, by = "word"
  18. # A tibble: 6 x 2
  19. line word
  20. <int> <chr>
  21. 1 1 roses
  22. 2 1 red
  23. 3 2 violets
  24. 4 2 blue
  25. 5 3 sugar
  26. 6 3 sweet

接着就是简单的计数统计:

  1. > table(h$word)
  2. blue red roses sugar sweet violets
  3. 1 1 1 1 1 1

开始实战

我本来打算用简奥斯汀的小说来挖掘,刚好有个包:

  1. library(janeaustenr)
  2. > original_books <- austen_books()
  3. > head(original_books)
  4. # A tibble: 6 x 2
  5. text book
  6. <chr> <fct>
  7. 1 "SENSE AND SENSIBILITY" Sense & Sensibility
  8. 2 "" Sense & Sensibility
  9. 3 "by Jane Austen" Sense & Sensibility
  10. 4 "" Sense & Sensibility
  11. 5 "(1811)" Sense & Sensibility
  12. 6 "" Sense & Sensibility
  13. > table(original_books$book)
  14. Sense & Sensibility Pride & Prejudice Mansfield Park
  15. 12624 13030 15349
  16. Emma Northanger Abbey Persuasion
  17. 16235 7856 8328

然而觉得这样就索然无味了,就是需要原生的txt 文本从数据清洗开始处理才有意思:

http://www.qcenglish.com/ebook/239.html

这里以《简爱》为例子:

031. R 的文本挖掘 - 图2

简单的看看排名靠前的非常见词文本:

  1. p_load(rvest, pengToolkit, wordcloud2, jiebaR, tidytext, dplyr, janeaustenr); beepr::beep(sound = "coin")
  2. # 读取全部文本
  3. lines <- readLines("Jane Eyre - Charlotte Bronte.txt")
  4. # 将向量转为tibble
  5. text_df <- tibble(line = 1:length(lines), text = lines)
  6. # 拆分所有语句,获得文字内容
  7. result <- unnest_tokens(tbl = text_df,output = word,input = text)
  8. # 删除常见词
  9. result <- anti_join(result, stop_words)
  10. # 统计词频率
  11. table(result$word)
  12. door reed time day bessie miss
  13. 54 65 66 67 86 149

还可以画个云图:

  1. # 云图可视化结果
  2. top_gene_freq_db <- as.data.frame(sort(table(result$word), decreasing = T))
  3. wordcloud2(subset(top_gene_freq_db, Freq >= 20))

031. R 的文本挖掘 - 图3

还是非常简单的!