DNAme array - MethylMix识别甲基化驱动基因 - 《DNA 甲基化数据分析工具和数据库》

背景
软件发表和说明书
示例流程代码复现
- 分步：1、获取TCGA上目标癌种的甲基化芯片和基因表达数据
网上教程
附录
- TCGA 癌症研究类型和Codes

背景

当联合甲基化和转录组数据时，有不少文献直接用交集求出基因。但还有个更科学的方法——MethlMix包通过Beta混合模型（beta mixture model）识别低/高甲基化状态，并通过约对应基因的表达水平的相关性分析从而找到疾病中DNA甲基化驱动的基因。
DNA 甲基化是一种将甲基添加到 CpG 位点的机制。这些 CpG 位点的甲基化与基因沉默有关，是正常组织发育的重要机制，通常与癌症等疾病有关。最近已经产生了许多高通量数据，在全基因组基础上分析 CpG 位点甲基化。这为许多疾病创造了大量关于 DNA 甲基化的数据。需要对 DNA 甲基化数据进行计算分析，以识别与正常组织相比潜在的异常 DNA 甲基化。
MethylMix (Gevaert 2015; Gevaert, Tibshirani & Plevritis 2015) 的开发是为了使用计算方法解决这个问题。MethylMix 通过使用 Beta 混合模型来识别与正常组织相比具有不同 DNA 甲基化的样本亚群，从而识别差异和功能性 DNA 甲基化。功能性 DNA 甲基化是指基于匹配的基因表达数据的显着负相关。

软件发表和说明书

文章发表在Bioinformatics，Published: 14 April 2018，标题为：MethylMix 2.0: an R package for identifying DNA methylation genes

使用说明书

示例流程代码复现

# 主题：MethylMix识别甲基化驱动基因------
# Tools: MethylMix包
# R包教程地址：https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
# 运行前的一些参数设置--------------
# rm(list = ls()) 
options(stringsAsFactors = F)
# 如果要想从TCGA下载目标癌症的数据，下载时间可能会非常久
# 那么可能需要将反应时间延长
# To retrieve an option
getOption('timeout')
# To set an option
options(timeout=10000)
# 安装R包-------------
# if (!requireNamespace("BiocManager", quietly=TRUE))
#   install.packages("BiocManager")
# BiocManager::install("MethylMix")
library("MethylMix")
# 多核并行，不推荐单核运行，因为花的时间太久了
library(doParallel)
cl <- makeCluster(15)
registerDoParallel(cl)
# 获取TCGA的目标癌症数据的27K/450K甲基化和基因表达数据(GetData)---------
# 方法一：GetData函数一次性获取和预处理不同的组学数据
# 注意查看这个函数做的内容，此函数耗时非常非常非常久，嫌时间长可以直接使用测试数据
help(GetData)
cancerSite <- "LIHC"
targetDirectory <- "/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/"
GetData(cancerSite, targetDirectory)
stopCluster(cl)
# 方法二：单独的函数获取不同的组学数据
cancerSite <- "LIHC"
home="/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/21211231_test/"
targetDirectory <- paste0(home, "/")
# Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
# Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
# Saving methylation processed data
saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
# Downloading gene expression data
GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
# Processing gene expression data
GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
# Saving gene expression processed data
saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
# Clustering probes to genes methylation data
METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
# Putting everything together in one file
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
               GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
stopCluster(cl)
# 4. Data input for MethylMix------
library(MethylMix)
data(METcancer)
data(METnormal)
data(GEcancer)
dim(METcancer);head(METcancer[, 1:4]) #14 251
dim(METnormal);head(METnormal) #14  4
dim(GEcancer);head(GEcancer[, 1:4]) #14 251
# 5. Running MethylMix---------
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
# MethylMixResults是最重要的计算结果，它存储以下6个对象，分别为：
# 1. MethylationDrivers:得到driver gene（差异甲基化，转录调控）
# 2. NrComponents: 每个driver gene的对应的甲基化状态；
# 3. MixtureStates: 每个driver—gene的差异甲基化值（即肿瘤样本与正常样本之间的平均甲基化的差值）；
# 4. MethylationStates: 以样本为列，driver gene为行的差异甲基化值的矩阵；
# 5. Classifications: Matrix with integers indicating to which mixture component each cancer sample was assigned to, for each gene
# 6. Models：每个driver gene的beta mixture model参数；
# 简单查看这些对象
MethylMixResults$MethylationDrivers
length(MethylMixResults$NrComponents)
MethylMixResults$NrComponents
MethylMixResults$MixtureStates
# 6. Graphical output---------------
# Plot the most famous methylated gene for glioblastoma
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer)
plots$MixtureModelPlot
# Plot MGMT also with its normal methylation variation
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
# Plot a MethylMix model for another gene
plots <- MethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
# Also plot the inverse correlation with gene expression (creates two separate
# plots)
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)
plots$MixtureModelPlot
plots$CorrelationPlot
# Plot all functional and differential genes
pdf("demo_MethylMix_MixtureModelPlot.pdf")
for (gene in MethylMixResults$MethylationDrivers) {
  plots <- MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)
  plots$MixtureModelPlot
  plots$CorrelationPlot
}
dev.off()

分步：1、获取TCGA上目标癌种的甲基化芯片和基因表达数据

注意：下载数据的环境配置
问题描述：R语言默认的timeout时间为60 seconds，这么短的时间如果要下载大文件，经常容易因为下载期间长期没反馈导致time out报错，如：Timeout of 60 seconds was reached
解决办法：设置timeout时间更久。

方法一：使用GetData函数一次性获取转录组和DNA甲基化芯片的数据
函数：GetData
注意查看这个函数做的内容（help(GetData)），如果下载的TCGA癌症类型为样本数目较多的癌症（癌症缩略词见附表），此函数耗时非常非常非常久，久到四五天也没有结果，放后台运行了至少十天左右，最终获得可以用于下游分析的文件！嫌时间长可以直接使用测试数据。

#!/path/to/Rscript
# 主题：MethylMix识别甲基化驱动基因------
# Tools: MethylMix包
# R包教程地址：https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
# 安装R包-------------
# rm(list = ls()) 
options(stringsAsFactors = F)
# if (!requireNamespace("BiocManager", quietly=TRUE))
#   install.packages("BiocManager")
# BiocManager::install(c("MethylMix", "doParallel"))
library(MethylMix)
# 获取TCGA的27K/450K甲基化和基因表达数据(GetData)---------
# 单核运行（不推荐）
# cancerSite <- "LIHC"
# targetDirectory <- paste0(getwd(), "/")
# GetData(cancerSite, targetDirectory)
# 多核并行
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
cancerSite <- "LIHC"
targetDirectory <- paste0(getwd(), "/")
GetData(cancerSite, targetDirectory)
stopCluster(cl)

方法二：使用MethylMix包的其它函数单独下载不同的组学数据
代码如下：

cancerSite <- "OV"
targetDirectory <- paste0(getwd(), "/")
cl <- makeCluster(5)
registerDoParallel(cl)
# Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
# Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
# Saving methylation processed data
saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
# Downloading gene expression data
GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
# Processing gene expression data
GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
# Saving gene expression processed data
saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
# Clustering probes to genes methylation data
METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
# Putting everything together in one file
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
    GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
stopCluster(cl)

网上教程

MethylMix识别甲基化驱动基因
太猛了！2021年这个套路发了134篇5+生信文章，还不快学！
使用MethylMix包识别甲基化驱动的癌症基因，下面这个数据挖掘路线图就是来自文章：Methylation-Driven Genes Identified as Novel Prognostic Indicators for Thyroid Carcinoma

附录

TCGA 癌症研究类型和Codes

TCGA / 癌症简称 / 缩写 / TCGA癌症中英文对照，链接：TCGA Study Abbreviations

Study Abbreviation	Study Name	中文全称
ACC	Adrenocortical carcinoma	肾上腺皮质癌
BLCA	Bladder Urothelial Carcinoma	膀胱尿路上皮癌
BRCA	Breast invasive carcinoma	乳腺浸润癌
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma	宫颈鳞癌和腺癌
CHOL	Cholangiocarcinoma	胆管癌
COAD	Colon adenocarcinoma	结肠癌
COADREAD	Colon adenocarcinoma/Rectum adenocarcinoma Esophageal carcinoma	结直肠癌
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	弥漫性大B细胞淋巴瘤
ESCA	Esophageal carcinoma	食管癌
FPPP	FFPE Pilot Phase II	FFPE试点二期
GBM	Glioblastoma multiforme	多形成性胶质细胞瘤
GBMLGG	Glioma	胶质瘤
HNSC	Head and Neck squamous cell carcinoma	头颈鳞状细胞癌
KICH	Kidney Chromophobe	肾嫌色细胞癌
KIPAN	Pan-kidney cohort (KICH+KIRC+KIRP)	混合肾癌
KIRC	Kidney renal clear cell carcinoma	肾透明细胞癌
KIRP	Kidney renal papillary cell carcinoma	肾乳头状细胞癌
LAML	Acute Myeloid Leukemia	急性髓细胞样白血病
LGG	Brain Lower Grade Glioma	脑低级别胶质瘤
LIHC	Liver hepatocellular carcinoma	肝细胞肝癌
LUAD	Lung adenocarcinoma	肺腺癌
LUSC	Lung squamous cell carcinoma	肺鳞癌
MESO	Mesothelioma	间皮瘤
OV	Ovarian serous cystadenocarcinoma	卵巢浆液性囊腺癌
PAAD	Pancreatic adenocarcinoma	胰腺癌
PCPG	Pheochromocytoma and Paraganglioma	嗜铬细胞瘤和副神经节瘤
PRAD	Prostate adenocarcinoma	前列腺癌
READ	Rectum adenocarcinoma	直肠腺癌
SARC	Sarcoma	肉瘤
SKCM	Skin Cutaneous Melanoma	皮肤黑色素瘤
STAD	Stomach adenocarcinoma	胃癌
STES	Stomach and Esophageal carcinoma	胃和食管癌
TGCT	Testicular Germ Cell Tumors	睾丸癌
THCA	Thyroid carcinoma	甲状腺癌
THYM	Thymoma	胸腺癌
UCEC	Uterine Corpus Endometrial Carcinoma	子宫内膜癌
UCS	Uterine Carcinosarcoma	子宫肉瘤
UVM	Uveal Melanoma	葡萄膜黑色素瘤