背景

当联合甲基化和转录组数据时,有不少文献直接用交集求出基因。但还有个更科学的方法——MethlMix包通过Beta混合模型(beta mixture model)识别低/高甲基化状态,并通过约对应基因的表达水平的相关性分析从而找到疾病中DNA甲基化驱动的基因。
DNA 甲基化是一种将甲基添加到 CpG 位点的机制。这些 CpG 位点的甲基化与基因沉默有关,是正常组织发育的重要机制,通常与癌症等疾病有关。最近已经产生了许多高通量数据,在全基因组基础上分析 CpG 位点甲基化。这为许多疾病创造了大量关于 DNA 甲基化的数据。需要对 DNA 甲基化数据进行计算分析,以识别与正常组织相比潜在的异常 DNA 甲基化。
MethylMix (Gevaert 2015; Gevaert, Tibshirani & Plevritis 2015) 的开发是为了使用计算方法解决这个问题。MethylMix 通过使用 Beta 混合模型来识别与正常组织相比具有不同 DNA 甲基化的样本亚群,从而识别差异和功能性 DNA 甲基化。功能性 DNA 甲基化是指基于匹配的基因表达数据的显着负相关。

软件发表和说明书

  1. 文章发表在Bioinformatics,Published: 14 April 2018,标题为:MethylMix 2.0: an R package for identifying DNA methylation genes


  1. 使用说明书

示例流程代码复现

  1. # 主题:MethylMix识别甲基化驱动基因------
  2. # Tools: MethylMix包
  3. # R包教程地址:https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
  4. # 运行前的一些参数设置--------------
  5. # rm(list = ls())
  6. options(stringsAsFactors = F)
  7. # 如果要想从TCGA下载目标癌症的数据,下载时间可能会非常久
  8. # 那么可能需要将反应时间延长
  9. # To retrieve an option
  10. getOption('timeout')
  11. # To set an option
  12. options(timeout=10000)
  13. # 安装R包-------------
  14. # if (!requireNamespace("BiocManager", quietly=TRUE))
  15. # install.packages("BiocManager")
  16. # BiocManager::install("MethylMix")
  17. library("MethylMix")
  18. # 多核并行,不推荐单核运行,因为花的时间太久了
  19. library(doParallel)
  20. cl <- makeCluster(15)
  21. registerDoParallel(cl)
  22. # 获取TCGA的目标癌症数据的27K/450K甲基化和基因表达数据(GetData)---------
  23. # 方法一:GetData函数一次性获取和预处理不同的组学数据
  24. # 注意查看这个函数做的内容,此函数耗时非常非常非常久,嫌时间长可以直接使用测试数据
  25. help(GetData)
  26. cancerSite <- "LIHC"
  27. targetDirectory <- "/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/"
  28. GetData(cancerSite, targetDirectory)
  29. stopCluster(cl)
  30. # 方法二:单独的函数获取不同的组学数据
  31. cancerSite <- "LIHC"
  32. home="/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/21211231_test/"
  33. targetDirectory <- paste0(home, "/")
  34. # Downloading methylation data
  35. METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
  36. # Processing methylation data
  37. METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
  38. # Saving methylation processed data
  39. saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
  40. # Downloading gene expression data
  41. GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
  42. # Processing gene expression data
  43. GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
  44. # Saving gene expression processed data
  45. saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
  46. # Clustering probes to genes methylation data
  47. METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
  48. res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
  49. # Putting everything together in one file
  50. toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
  51. GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
  52. saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
  53. stopCluster(cl)
  54. # 4. Data input for MethylMix------
  55. library(MethylMix)
  56. data(METcancer)
  57. data(METnormal)
  58. data(GEcancer)
  59. dim(METcancer);head(METcancer[, 1:4]) #14 251
  60. dim(METnormal);head(METnormal) #14 4
  61. dim(GEcancer);head(GEcancer[, 1:4]) #14 251
  62. # 5. Running MethylMix---------
  63. MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
  64. # MethylMixResults是最重要的计算结果,它存储以下6个对象,分别为:
  65. # 1. MethylationDrivers:得到driver gene(差异甲基化,转录调控)
  66. # 2. NrComponents: 每个driver gene的对应的甲基化状态;
  67. # 3. MixtureStates: 每个driver—gene的差异甲基化值(即肿瘤样本与正常样本之间的平均甲基化的差值);
  68. # 4. MethylationStates: 以样本为列,driver gene为行的差异甲基化值的矩阵;
  69. # 5. Classifications: Matrix with integers indicating to which mixture component each cancer sample was assigned to, for each gene
  70. # 6. Models:每个driver gene的beta mixture model参数;
  71. # 简单查看这些对象
  72. MethylMixResults$MethylationDrivers
  73. length(MethylMixResults$NrComponents)
  74. MethylMixResults$NrComponents
  75. MethylMixResults$MixtureStates
  76. # 6. Graphical output---------------
  77. # Plot the most famous methylated gene for glioblastoma
  78. plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer)
  79. plots$MixtureModelPlot
  80. # Plot MGMT also with its normal methylation variation
  81. plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)
  82. plots$MixtureModelPlot
  83. # Plot a MethylMix model for another gene
  84. plots <- MethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)
  85. plots$MixtureModelPlot
  86. # Also plot the inverse correlation with gene expression (creates two separate
  87. # plots)
  88. plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)
  89. plots$MixtureModelPlot
  90. plots$CorrelationPlot
  91. # Plot all functional and differential genes
  92. pdf("demo_MethylMix_MixtureModelPlot.pdf")
  93. for (gene in MethylMixResults$MethylationDrivers) {
  94. plots <- MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)
  95. plots$MixtureModelPlot
  96. plots$CorrelationPlot
  97. }
  98. dev.off()

分步:1、获取TCGA上目标癌种的甲基化芯片和基因表达数据

注意:下载数据的环境配置
问题描述:R语言默认的timeout时间为60 seconds,这么短的时间如果要下载大文件,经常容易因为下载期间长期没反馈导致time out报错,如:Timeout of 60 seconds was reached
解决办法:设置timeout时间更久。

方法一:使用GetData函数一次性获取转录组和DNA甲基化芯片的数据
函数:GetData
注意查看这个函数做的内容(help(GetData)),如果下载的TCGA癌症类型为样本数目较多的癌症(癌症缩略词见附表),此函数耗时非常非常非常久,久到四五天也没有结果,放后台运行了至少十天左右,最终获得可以用于下游分析的文件!嫌时间长可以直接使用测试数据。

  1. #!/path/to/Rscript
  2. # 主题:MethylMix识别甲基化驱动基因------
  3. # Tools: MethylMix包
  4. # R包教程地址:https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
  5. # 安装R包-------------
  6. # rm(list = ls())
  7. options(stringsAsFactors = F)
  8. # if (!requireNamespace("BiocManager", quietly=TRUE))
  9. # install.packages("BiocManager")
  10. # BiocManager::install(c("MethylMix", "doParallel"))
  11. library(MethylMix)
  12. # 获取TCGA的27K/450K甲基化和基因表达数据(GetData)---------
  13. # 单核运行(不推荐)
  14. # cancerSite <- "LIHC"
  15. # targetDirectory <- paste0(getwd(), "/")
  16. # GetData(cancerSite, targetDirectory)
  17. # 多核并行
  18. library(doParallel)
  19. cl <- makeCluster(5)
  20. registerDoParallel(cl)
  21. cancerSite <- "LIHC"
  22. targetDirectory <- paste0(getwd(), "/")
  23. GetData(cancerSite, targetDirectory)
  24. stopCluster(cl)

方法二:使用MethylMix包的其它函数单独下载不同的组学数据
代码如下:

  1. cancerSite <- "OV"
  2. targetDirectory <- paste0(getwd(), "/")
  3. cl <- makeCluster(5)
  4. registerDoParallel(cl)
  5. # Downloading methylation data
  6. METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
  7. # Processing methylation data
  8. METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
  9. # Saving methylation processed data
  10. saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
  11. # Downloading gene expression data
  12. GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
  13. # Processing gene expression data
  14. GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
  15. # Saving gene expression processed data
  16. saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
  17. # Clustering probes to genes methylation data
  18. METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
  19. res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
  20. # Putting everything together in one file
  21. toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
  22. GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
  23. saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
  24. stopCluster(cl)

网上教程

image.png

附录

TCGA 癌症研究类型和Codes

TCGA / 癌症简称 / 缩写 / TCGA癌症中英文对照,链接:TCGA Study Abbreviations

Study Abbreviation Study Name 中文全称
ACC Adrenocortical carcinoma 肾上腺皮质癌
BLCA Bladder Urothelial Carcinoma 膀胱尿路上皮癌
BRCA Breast invasive carcinoma 乳腺浸润癌
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma 宫颈鳞癌和腺癌
CHOL Cholangiocarcinoma 胆管癌
COAD Colon adenocarcinoma 结肠癌
COADREAD Colon adenocarcinoma/Rectum adenocarcinoma Esophageal carcinoma 结直肠癌
DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 弥漫性大B细胞淋巴瘤
ESCA Esophageal carcinoma 食管癌
FPPP FFPE Pilot Phase II FFPE试点二期
GBM Glioblastoma multiforme 多形成性胶质细胞瘤
GBMLGG Glioma 胶质瘤
HNSC Head and Neck squamous cell carcinoma 头颈鳞状细胞癌
KICH Kidney Chromophobe 肾嫌色细胞癌
KIPAN Pan-kidney cohort (KICH+KIRC+KIRP) 混合肾癌
KIRC Kidney renal clear cell carcinoma 肾透明细胞癌
KIRP Kidney renal papillary cell carcinoma 肾乳头状细胞癌
LAML Acute Myeloid Leukemia 急性髓细胞样白血病
LGG Brain Lower Grade Glioma 脑低级别胶质瘤
LIHC Liver hepatocellular carcinoma 肝细胞肝癌
LUAD Lung adenocarcinoma 肺腺癌
LUSC Lung squamous cell carcinoma 肺鳞癌
MESO Mesothelioma 间皮瘤
OV Ovarian serous cystadenocarcinoma 卵巢浆液性囊腺癌
PAAD Pancreatic adenocarcinoma 胰腺癌
PCPG Pheochromocytoma and Paraganglioma 嗜铬细胞瘤和副神经节瘤
PRAD Prostate adenocarcinoma 前列腺癌
READ Rectum adenocarcinoma 直肠腺癌
SARC Sarcoma 肉瘤
SKCM Skin Cutaneous Melanoma 皮肤黑色素瘤
STAD Stomach adenocarcinoma 胃癌
STES Stomach and Esophageal carcinoma 胃和食管癌
TGCT Testicular Germ Cell Tumors 睾丸癌
THCA Thyroid carcinoma 甲状腺癌
THYM Thymoma 胸腺癌
UCEC Uterine Corpus Endometrial Carcinoma 子宫内膜癌
UCS Uterine Carcinosarcoma 子宫肉瘤
UVM Uveal Melanoma 葡萄膜黑色素瘤