背景
当联合甲基化和转录组数据时,有不少文献直接用交集求出基因。但还有个更科学的方法——MethlMix
包通过Beta混合模型(beta mixture model)识别低/高甲基化状态,并通过约对应基因的表达水平的相关性分析从而找到疾病中DNA甲基化驱动的基因。
DNA 甲基化是一种将甲基添加到 CpG 位点的机制。这些 CpG 位点的甲基化与基因沉默有关,是正常组织发育的重要机制,通常与癌症等疾病有关。最近已经产生了许多高通量数据,在全基因组基础上分析 CpG 位点甲基化。这为许多疾病创造了大量关于 DNA 甲基化的数据。需要对 DNA 甲基化数据进行计算分析,以识别与正常组织相比潜在的异常 DNA 甲基化。
MethylMix (Gevaert 2015; Gevaert, Tibshirani & Plevritis 2015) 的开发是为了使用计算方法解决这个问题。MethylMix 通过使用 Beta 混合模型来识别与正常组织相比具有不同 DNA 甲基化的样本亚群,从而识别差异和功能性 DNA 甲基化。功能性 DNA 甲基化是指基于匹配的基因表达数据的显着负相关。
软件发表和说明书
- 文章发表在Bioinformatics,Published: 14 April 2018,标题为:MethylMix 2.0: an R package for identifying DNA methylation genes
- 使用说明书
- bioconductor: MethylMix
- https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
- https://github.com/gevaertlab/MethylMix
示例流程代码复现
# 主题:MethylMix识别甲基化驱动基因------
# Tools: MethylMix包
# R包教程地址:https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
# 运行前的一些参数设置--------------
# rm(list = ls())
options(stringsAsFactors = F)
# 如果要想从TCGA下载目标癌症的数据,下载时间可能会非常久
# 那么可能需要将反应时间延长
# To retrieve an option
getOption('timeout')
# To set an option
options(timeout=10000)
# 安装R包-------------
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("MethylMix")
library("MethylMix")
# 多核并行,不推荐单核运行,因为花的时间太久了
library(doParallel)
cl <- makeCluster(15)
registerDoParallel(cl)
# 获取TCGA的目标癌症数据的27K/450K甲基化和基因表达数据(GetData)---------
# 方法一:GetData函数一次性获取和预处理不同的组学数据
# 注意查看这个函数做的内容,此函数耗时非常非常非常久,嫌时间长可以直接使用测试数据
help(GetData)
cancerSite <- "LIHC"
targetDirectory <- "/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/"
GetData(cancerSite, targetDirectory)
stopCluster(cl)
# 方法二:单独的函数获取不同的组学数据
cancerSite <- "LIHC"
home="/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/21211231_test/"
targetDirectory <- paste0(home, "/")
# Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
# Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
# Saving methylation processed data
saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
# Downloading gene expression data
GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
# Processing gene expression data
GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
# Saving gene expression processed data
saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
# Clustering probes to genes methylation data
METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
# Putting everything together in one file
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
stopCluster(cl)
# 4. Data input for MethylMix------
library(MethylMix)
data(METcancer)
data(METnormal)
data(GEcancer)
dim(METcancer);head(METcancer[, 1:4]) #14 251
dim(METnormal);head(METnormal) #14 4
dim(GEcancer);head(GEcancer[, 1:4]) #14 251
# 5. Running MethylMix---------
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
# MethylMixResults是最重要的计算结果,它存储以下6个对象,分别为:
# 1. MethylationDrivers:得到driver gene(差异甲基化,转录调控)
# 2. NrComponents: 每个driver gene的对应的甲基化状态;
# 3. MixtureStates: 每个driver—gene的差异甲基化值(即肿瘤样本与正常样本之间的平均甲基化的差值);
# 4. MethylationStates: 以样本为列,driver gene为行的差异甲基化值的矩阵;
# 5. Classifications: Matrix with integers indicating to which mixture component each cancer sample was assigned to, for each gene
# 6. Models:每个driver gene的beta mixture model参数;
# 简单查看这些对象
MethylMixResults$MethylationDrivers
length(MethylMixResults$NrComponents)
MethylMixResults$NrComponents
MethylMixResults$MixtureStates
# 6. Graphical output---------------
# Plot the most famous methylated gene for glioblastoma
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer)
plots$MixtureModelPlot
# Plot MGMT also with its normal methylation variation
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
# Plot a MethylMix model for another gene
plots <- MethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
# Also plot the inverse correlation with gene expression (creates two separate
# plots)
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)
plots$MixtureModelPlot
plots$CorrelationPlot
# Plot all functional and differential genes
pdf("demo_MethylMix_MixtureModelPlot.pdf")
for (gene in MethylMixResults$MethylationDrivers) {
plots <- MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
plots$CorrelationPlot
}
dev.off()
分步:1、获取TCGA上目标癌种的甲基化芯片和基因表达数据
注意:下载数据的环境配置
问题描述:R语言默认的timeout时间为60 seconds,这么短的时间如果要下载大文件,经常容易因为下载期间长期没反馈导致time out报错,如:Timeout of 60 seconds was reached
解决办法:设置timeout时间更久。
方法一:使用GetData函数一次性获取转录组和DNA甲基化芯片的数据
函数:GetData
注意查看这个函数做的内容(help(GetData)),如果下载的TCGA癌症类型为样本数目较多的癌症(癌症缩略词见附表),此函数耗时非常非常非常久,久到四五天也没有结果,放后台运行了至少十天左右,最终获得可以用于下游分析的文件!嫌时间长可以直接使用测试数据。
#!/path/to/Rscript
# 主题:MethylMix识别甲基化驱动基因------
# Tools: MethylMix包
# R包教程地址:https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
# 安装R包-------------
# rm(list = ls())
options(stringsAsFactors = F)
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install(c("MethylMix", "doParallel"))
library(MethylMix)
# 获取TCGA的27K/450K甲基化和基因表达数据(GetData)---------
# 单核运行(不推荐)
# cancerSite <- "LIHC"
# targetDirectory <- paste0(getwd(), "/")
# GetData(cancerSite, targetDirectory)
# 多核并行
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
cancerSite <- "LIHC"
targetDirectory <- paste0(getwd(), "/")
GetData(cancerSite, targetDirectory)
stopCluster(cl)
方法二:使用MethylMix包的其它函数单独下载不同的组学数据
代码如下:
cancerSite <- "OV"
targetDirectory <- paste0(getwd(), "/")
cl <- makeCluster(5)
registerDoParallel(cl)
# Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
# Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
# Saving methylation processed data
saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
# Downloading gene expression data
GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
# Processing gene expression data
GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
# Saving gene expression processed data
saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
# Clustering probes to genes methylation data
METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
# Putting everything together in one file
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
stopCluster(cl)
网上教程
- MethylMix识别甲基化驱动基因
- 太猛了!2021年这个套路发了134篇5+生信文章,还不快学!
- 使用MethylMix包识别甲基化驱动的癌症基因,下面这个数据挖掘路线图就是来自文章:Methylation-Driven Genes Identified as Novel Prognostic Indicators for Thyroid Carcinoma
附录
TCGA 癌症研究类型和Codes
TCGA / 癌症简称 / 缩写 / TCGA癌症中英文对照,链接:TCGA Study Abbreviations
Study Abbreviation | Study Name | 中文全称 |
---|---|---|
ACC | Adrenocortical carcinoma | 肾上腺皮质癌 |
BLCA | Bladder Urothelial Carcinoma | 膀胱尿路上皮癌 |
BRCA | Breast invasive carcinoma | 乳腺浸润癌 |
CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 宫颈鳞癌和腺癌 |
CHOL | Cholangiocarcinoma | 胆管癌 |
COAD | Colon adenocarcinoma | 结肠癌 |
COADREAD | Colon adenocarcinoma/Rectum adenocarcinoma Esophageal carcinoma | 结直肠癌 |
DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 弥漫性大B细胞淋巴瘤 |
ESCA | Esophageal carcinoma | 食管癌 |
FPPP | FFPE Pilot Phase II | FFPE试点二期 |
GBM | Glioblastoma multiforme | 多形成性胶质细胞瘤 |
GBMLGG | Glioma | 胶质瘤 |
HNSC | Head and Neck squamous cell carcinoma | 头颈鳞状细胞癌 |
KICH | Kidney Chromophobe | 肾嫌色细胞癌 |
KIPAN | Pan-kidney cohort (KICH+KIRC+KIRP) | 混合肾癌 |
KIRC | Kidney renal clear cell carcinoma | 肾透明细胞癌 |
KIRP | Kidney renal papillary cell carcinoma | 肾乳头状细胞癌 |
LAML | Acute Myeloid Leukemia | 急性髓细胞样白血病 |
LGG | Brain Lower Grade Glioma | 脑低级别胶质瘤 |
LIHC | Liver hepatocellular carcinoma | 肝细胞肝癌 |
LUAD | Lung adenocarcinoma | 肺腺癌 |
LUSC | Lung squamous cell carcinoma | 肺鳞癌 |
MESO | Mesothelioma | 间皮瘤 |
OV | Ovarian serous cystadenocarcinoma | 卵巢浆液性囊腺癌 |
PAAD | Pancreatic adenocarcinoma | 胰腺癌 |
PCPG | Pheochromocytoma and Paraganglioma | 嗜铬细胞瘤和副神经节瘤 |
PRAD | Prostate adenocarcinoma | 前列腺癌 |
READ | Rectum adenocarcinoma | 直肠腺癌 |
SARC | Sarcoma | 肉瘤 |
SKCM | Skin Cutaneous Melanoma | 皮肤黑色素瘤 |
STAD | Stomach adenocarcinoma | 胃癌 |
STES | Stomach and Esophageal carcinoma | 胃和食管癌 |
TGCT | Testicular Germ Cell Tumors | 睾丸癌 |
THCA | Thyroid carcinoma | 甲状腺癌 |
THYM | Thymoma | 胸腺癌 |
UCEC | Uterine Corpus Endometrial Carcinoma | 子宫内膜癌 |
UCS | Uterine Carcinosarcoma | 子宫肉瘤 |
UVM | Uveal Melanoma | 葡萄膜黑色素瘤 |