语雀:左手柳叶刀右手炭火烧
微信公众号:研平方 | 简书:研平方
关注可了解更多的科研教程及技巧。如有问题或建议,请留言。
欢迎关注我:一起学习,一起进步!

最近,小编“扫荡”文献时,发现一个令我十分感兴趣的应用,利用文本文本挖掘技术可以评估选定基因与癌症之间的关联。提到文本挖掘这类技术,小编当然要一探究竟了。

1.原文如下

Literature evidence for the identified target genes in cancer

We used OncoScore, a text mining tool to assess the associations between each gene and specific cancers based on the literature. A cutoff value of 21.09 was suggested to determine true positives and the true negatives in cancer gene identification.

2.查找资料

习惯性的打开浏览器,准备打破砂锅问到底,惊喜的发现,OncoScore竟然是一个写好的R包,而且放在了Bioconductor网页,可直接进行安装、使用。虽然文章发在了Sci Rep杂志上,但是小编认为还是值得一试。

【生信分析】-文本挖掘目标基因 评估致癌能力,你造嘛? - 图1

【生信分析】-文本挖掘目标基因 评估致癌能力,你造嘛? - 图2

3.它能干什么

The OncoScore analysis consists of two parts. One can estimate a score to asses the
oncogenic potential of a set of genes, given the lecterature knowledge, at the time of the
analysis, or one can study the trend of such score over time.

可见,OncoScore不仅可以依据文献中的知识,对一组设定目标基因列表的致癌能力进行评分,还可以研究这个分数随时间的趋势。

4.开始表演,拿好小板凳看戏

4.1 准备工作

  1. if (!requireNamespace("BiocManager", quietly = TRUE))
  2. install.packages("BiocManager")
  3. BiocManager::install("OncoScore")
  4. # load the library
  5. library(OncoScore)
  6. # Define a query
  7. query = perform.query(c("ASXL1","IDH1","IDH2","SETBP1","TET2"))
  8. ### Starting the queries for the selected genes.
  9. ### Performing queries for cancer literature
  10. Number of papers found in PubMed for ASXL1 was: 923
  11. Number of papers found in PubMed for IDH1 was: 3691
  12. Number of papers found in PubMed for IDH2 was: 1318
  13. Number of papers found in PubMed for SETBP1 was: 177
  14. Number of papers found in PubMed for TET2 was: 1609
  15. ### Performing queries for all the literature
  16. Number of papers found in PubMed for ASXL1 was: 1018
  17. Number of papers found in PubMed for IDH1 was: 3902
  18. Number of papers found in PubMed for IDH2 was: 1499
  19. Number of papers found in PubMed for SETBP1 was: 229
  20. Number of papers found in PubMed for TET2 was: 2117

以上我们可以发现,通过检索,得到了癌症相关研究的文献数量,以及所有与检索基因相关文献数量。

OncoScore provides a function to merge gene names if requested by the user. This function is useful when there are aliases in the gene list.

  1. combine.query.results(query, c('IDH1', 'IDH2'), 'new_gene')
  2. CitationsGene CitationsGeneInCancer
  3. ASXL1 1018 923
  4. SETBP1 229 177
  5. TET2 2117 1609
  6. new_gene 5401 5009

当然,OncoScore还可以依据染色体信息检索基因。这里不再演示。

4.2 重点来啦

4.2.1 开始计算基因的致癌评分
  1. result = compute.oncoscore(query)
  2. ### Processing data
  3. ### Computing frequencies scores
  4. ### Estimating oncogenes
  5. ### Results:
  6. ASXL1 -> 81.59349
  7. IDH1 -> 86.66355
  8. IDH2 -> 79.59096
  9. SETBP1 -> 67.43283
  10. TET2 -> 69.12424

4.2.2 时间趋势分析(OncoScore timeline analysis)
  1. query.timepoints = perform.query.timeseries(c("ASXL1","IDH1","IDH2","SETBP1","TET2"),
  2. c("2012/03/01", "2013/03/01", "2014/03/01", "2015/03/01", "2016/03/01"))
  3. ### Starting the queries for the selected genes.
  4. ### Quering PubMed for timepoint 2012/03/01
  5. ### Performing queries for cancer literature
  6. Number of papers found in PubMed for ASXL1 was: 86
  7. Number of papers found in PubMed for IDH1 was: 409
  8. Number of papers found in PubMed for IDH2 was: 173
  9. Number of papers found in PubMed for SETBP1 was: 5
  10. Number of papers found in PubMed for TET2 was: 173
  11. ### Performing queries for all the literature
  12. Number of papers found in PubMed for ASXL1 was: 92
  13. Number of papers found in PubMed for IDH1 was: 489
  14. Number of papers found in PubMed for IDH2 was: 235
  15. Number of papers found in PubMed for SETBP1 was: 10
  16. Number of papers found in PubMed for TET2 was: 197
  17. ### Quering PubMed for timepoint 2013/03/01
  18. ### Performing queries for cancer literature
  19. Number of papers found in PubMed for ASXL1 was: 135
  20. Number of papers found in PubMed for IDH1 was: 662
  21. Number of papers found in PubMed for IDH2 was: 267
  22. Number of papers found in PubMed for SETBP1 was: 11
  23. Number of papers found in PubMed for TET2 was: 258
  24. ### Performing queries for all the literature
  25. Number of papers found in PubMed for ASXL1 was: 150
  26. Number of papers found in PubMed for IDH1 was: 753
  27. Number of papers found in PubMed for IDH2 was: 336
  28. Number of papers found in PubMed for SETBP1 was: 18
  29. Number of papers found in PubMed for TET2 was: 303
  30. ### Quering PubMed for timepoint 2014/03/01
  31. ### Performing queries for cancer literature
  32. Number of papers found in PubMed for ASXL1 was: 188
  33. Number of papers found in PubMed for IDH1 was: 904
  34. Number of papers found in PubMed for IDH2 was: 365
  35. Number of papers found in PubMed for SETBP1 was: 29
  36. Number of papers found in PubMed for TET2 was: 347
  37. ### Performing queries for all the literature
  38. Number of papers found in PubMed for ASXL1 was: 209
  39. Number of papers found in PubMed for IDH1 was: 1003
  40. Number of papers found in PubMed for IDH2 was: 440
  41. Number of papers found in PubMed for SETBP1 was: 36
  42. Number of papers found in PubMed for TET2 was: 431
  43. ### Quering PubMed for timepoint 2015/03/01
  44. ### Performing queries for cancer literature
  45. Number of papers found in PubMed for ASXL1 was: 257
  46. Number of papers found in PubMed for IDH1 was: 1198
  47. Number of papers found in PubMed for IDH2 was: 468
  48. Number of papers found in PubMed for SETBP1 was: 51
  49. Number of papers found in PubMed for TET2 was: 461
  50. ### Performing queries for all the literature
  51. Number of papers found in PubMed for ASXL1 was: 286
  52. Number of papers found in PubMed for IDH1 was: 1304
  53. Number of papers found in PubMed for IDH2 was: 551
  54. Number of papers found in PubMed for SETBP1 was: 66
  55. Number of papers found in PubMed for TET2 was: 583
  56. ### Quering PubMed for timepoint 2016/03/01
  57. ### Performing queries for cancer literature
  58. Number of papers found in PubMed for ASXL1 was: 323
  59. Number of papers found in PubMed for IDH1 was: 1506
  60. Number of papers found in PubMed for IDH2 was: 569
  61. Number of papers found in PubMed for SETBP1 was: 68
  62. Number of papers found in PubMed for TET2 was: 587
  63. ### Performing queries for all the literature
  64. Number of papers found in PubMed for ASXL1 was: 359
  65. Number of papers found in PubMed for IDH1 was: 1625
  66. Number of papers found in PubMed for IDH2 was: 661
  67. Number of papers found in PubMed for SETBP1 was: 89
  68. Number of papers found in PubMed for TET2 was: 745

perform.query.timeseries ()函数检索了几个设定时间的文献数据信息。

  1. result.timeseries = compute.oncoscore.timeseries(query.timepoints)
  2. ### Computing oncoscore for timepoint 2012/03/01
  3. ### Processing data
  4. ### Computing frequencies scores
  5. ### Estimating oncogenes
  6. ### Results:
  7. ASXL1 -> 79.14893
  8. IDH1 -> 74.27776
  9. IDH2 -> 64.27063
  10. SETBP1 -> 34.9485
  11. TET2 -> 76.29579
  12. ### Computing oncoscore for timepoint 2013/03/01
  13. ### Processing data
  14. ### Computing frequencies scores
  15. ### Estimating oncogenes
  16. ### Results:
  17. ASXL1 -> 77.54983
  18. IDH1 -> 78.71551
  19. IDH2 -> 69.99559
  20. SETBP1 -> 46.4559
  21. TET2 -> 74.81894
  22. ### Computing oncoscore for timepoint 2014/03/01
  23. ### Processing data
  24. ### Computing frequencies scores
  25. ### Estimating oncogenes
  26. ### Results:
  27. ASXL1 -> 78.28121
  28. IDH1 -> 81.08963
  29. IDH2 -> 73.50788
  30. SETBP1 -> 64.97398
  31. TET2 -> 71.31087
  32. ### Computing oncoscore for timepoint 2015/03/01
  33. ### Processing data
  34. ### Computing frequencies scores
  35. ### Estimating oncogenes
  36. ### Results:
  37. ASXL1 -> 78.84769
  38. IDH1 -> 82.99363
  39. IDH2 -> 75.60886
  40. SETBP1 -> 64.48853
  41. TET2 -> 70.46695
  42. ### Computing oncoscore for timepoint 2016/03/01
  43. ### Processing data
  44. ### Computing frequencies scores
  45. ### Estimating oncogenes
  46. ### Results:
  47. ASXL1 -> 79.37202
  48. IDH1 -> 83.9881
  49. IDH2 -> 76.89328
  50. SETBP1 -> 64.60591
  51. TET2 -> 70.53378

4.2.3 可视化
  1. ## Oncogenetic potential of the considered genes
  2. plot.oncoscore(result, col = 'darkblue')
  3. ## Absolute values of the oncogenetic potential of the considered genes over times
  4. plot.oncoscore.timeseries(result.timeseries)
  5. ## Variations of the oncogenetic potential of the considered genes over times
  6. plot.oncoscore.timeseries(result.timeseries,
  7. incremental = TRUE,
  8. ylab='absolute variation')
  9. ## Variations as relative values of the oncogenetic potential of the considered genes over times
  10. plot.oncoscore.timeseries(result.timeseries,
  11. incremental = TRUE,
  12. relative = TRUE,
  13. ylab='relative variation')

【生信分析】-文本挖掘目标基因 评估致癌能力,你造嘛? - 图3

【生信分析】-文本挖掘目标基因 评估致癌能力,你造嘛? - 图4

【生信分析】-文本挖掘目标基因 评估致癌能力,你造嘛? - 图5

【生信分析】-文本挖掘目标基因 评估致癌能力,你造嘛? - 图6
Using the ONCOSCORE package.pdf