基因表达标准化

不同样品的测序量会有差异,最简单的标准化方式是计算 counts per million (CPM),即原始reads count除以总reads数乘以1,000,000。
这种计算方式的缺点是容易受到极高表达且在不同样品中存在差异表达的基因的影响;这些基因的打开或关闭会影响到细胞中总的分子数目,可能导致这些基因标准化之后就不存在表达差异了,而原本没有差异的基因标准化之后却有差异了。
RPKM、FPKM和TPM是CPM按照基因或转录本长度归一化后的表达,也会受到这一影响。

  1. calc_cpm <- function (expr_mat, spikes = NULL){
  2. norm_factor <- colSums(expr_mat[-spikes,])
  3. return(t(t(expr_mat)/norm_factor)) * 10^6
  4. }

为了解决这一问题,研究者提出了其它的标准化方法。
量化因子 (size factor, SF)是由DESeq提出的。其方法是首先计算每个基因在所有样品中表达的几何平均值。每个细胞的量化因子(size factor)是所有基因与其在所有样品中的表达值的几何平均值的比值的中位数。由于几何平均值的使用,只有在所有样品中表达都不为0的基因才能用来计算。这一方法又被称为 RLE (relative log expression)

  1. calc_sf <- function (expr_mat, spikes=NULL){
  2. geomeans <- exp(rowMeans(log(expr_mat[-spikes,])))
  3. SF <- function(cnts){
  4. median((cnts/geomeans)[(is.finite(geomeans) & geomeans >0)])
  5. }
  6. norm_factor <- apply(expr_mat[-spikes,],2,SF)
  7. return(t(t(expr_mat)/norm_factor))
  8. }

上四分位数 (upperquartile, UQ)是样品中所有基因的表达除以处于上四分位数的基因的表达值。同时为了保证表达水平的相对稳定,计算得到的上四分位数值要除以所有样品中上四分位数值的中位数。

  1. calc_uq <- function (expr_mat, spikes=NULL){
  2. UQ <- function(x) {
  3. quantile(x[x>0],0.75)
  4. }
  5. uq <- unlist(apply(expr_mat[-spikes,],2,UQ))
  6. norm_factor <- uq/median(uq)
  7. return(t(t(expr_mat)/norm_factor))
  8. }

TMM是M-值的加权截尾均值。选定一个样品为参照,其它样品中基因的表达相对于参照样品中对应基因表达倍数的log2值定义为M-值。随后去除M-值中最高和最低的30%,剩下的M值计算加权平均值。每一个非参照样品的基因表达值都乘以计算出的TMM。这个方法假设大部分基因的表达是没有差异的。

DESeq2 差异基因鉴定一步法

为了简化差异基因的运算,易生信做了脚本封装,DESeq2.sh,只需提供原始基因表达矩阵、样品分组信息表即可进行差异基因分析和鉴定。

  1. DESeq2.sh
  2. OPTIONS:
  3. -f Data file [A gene count matrix, NECESSARY]
  4. CHECK ABOVE FOR DETAILS
  5. -s Sample file [A multiple columns file with header line,
  6. For <timeseries>, one <conditions> columns is needed.
  7. NECESSARY]
  8. CHECK ABOVE FOR DETAILS
  9. -d The design formula for DESeqDataSetFromMatrix.
  10. [Default <conditions>,
  11. accept <cell+time+cell:time> for example 2.]
  12. -D The reduced design formula for DESeq.
  13. [Only applicable to <timeseries> analysis,
  14. accept <cell+time> or <time> or <cell> for example 2.]
  15. -m Specify the comparasion mode.
  16. [Default <pairwise>, accept <timeseries>,
  17. <pairwise> comparasion will still be done in <timeseries>
  18. mode.
  19. NECESSARY]
  20. -p A file containing the pairs needed to do comparasion.
  21. CHECK ABOVE FOR DETAILS
  22. All samples will be compared in <pairwise> mode if not specified here.
  23. -F Log2 Fold change for screening DE genes.
  24. Default 1
  25. -P FDR for screening DE genes.
  26. Default 0.01
  27. -q FDR for screening time-series DE genes.
  28. Default 0.1
  29. -e Execute programs
  30. Default TRUE
  31. -i Install packages of not exist.
  32. Default FALSE
  33. Eg.
  34. DESeq2.sh -f matirx -s sample -d conditions
  35. DESeq2.sh -f matirx -s sample -p compare_pair

具体使用

  1. cd ~/data
  2. # ehbio_trans.Count_matrix.xls和sampleFile都是前面用到的文件
  3. DESeq2.sh -f ehbio_trans.Count_matrix.xls -s sampleFile
  4. [1] "Perform pairwise comparasion using <design=~conditions>"
  5. estimating size factors
  6. estimating dispersions
  7. gene-wise dispersion estimates
  8. mean-dispersion relationship
  9. final dispersion estimates
  10. fitting model and testing
  11. [1] "Output normalized counts"
  12. [1] "Output rlog transformed normalized counts"
  13. [1] "Performing sample clustering"
  14. null device
  15. 1
  16. [1] "DE genes between untrt trt"
  17. [1] "PCA analysis"
  18. null device

运行结束即可获得以下结果文件

  1. # 具体的文件内容和图的样式见后面的分步法文档
  2. # 原始输入文件
  3. ehbio_trans.Count_matrix.xls
  4. sampleFile
  5. # 所有差异基因列表
  6. ehbio_trans.Count_matrix.xls.DESeq2.all.DE
  7. # PCA结果
  8. ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pca.pdf
  9. # 样品相关性层级聚类结果
  10. ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf
  11. # rlog转换后的标准化后的表达结果
  12. ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls
  13. # 标准化后的表达结果
  14. ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls
  15. # 运行脚本
  16. ehbio_trans.Count_matrix.xls.DESeq2.r
  17. # 差异基因结果
  18. ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.id.xls
  19. ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.xls
  20. ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.id.xls
  21. ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.xls
  22. # 火山图和火山图输入数据
  23. ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls
  24. ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls.volcano.pdf

DESeq2 差异基因鉴定分步法

安装包 DESeq2安装方法如下

  1. source("https://bioconductor.org/biocLite.R")
  2. biocLite('BiocInstaller')
  3. biocLite("DESeq2")
  4. install.packages(c("gplots", "amap", "ggplot2"))

A distributional assumption is needed because we want to estimate the probability of extreme events (large fold change just appearing by chance) from limited replicates. The negative binomial (a.k.a. Gamma-Poisson) is a good choice for RNA-seq experiments because

  • The observed data at gene level is inherently counts or estimated counts of fragments for each feature and
  • The spread of values among biological replicates is more than given by a simpler, one parameter distribution, the Poisson; and it seems to be captured by the NB sufficiently well

    加载包

    ```bash library(DESeq2)

    Loading required package: S4Vectors

    Loading required package: stats4

    Loading required package: BiocGenerics

    Loading required package: parallel

    Attaching package: ‘BiocGenerics’

    The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,

    clusterExport, clusterMap, parApply, parCapply, parLapply,

    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, cbind, colMeans,

colnames, colSums, do.call, duplicated, eval, evalq, Filter,

Find, get, grep, grepl, intersect, is.unsorted, lapply,

lengths, Map, mapply, match, mget, order, paste, pmax,

pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,

rowMeans, rownames, rowSums, sapply, setdiff, sort, table,

tapply, union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges

Loading required package: GenomicRanges

Loading required package: GenomeInfoDb

Loading required package: SummarizedExperiment

Loading required package: Biobase

Welcome to Bioconductor

Vignettes contain introductory material; view with

‘browseVignettes()’. To cite Bioconductor, see

‘citation(“Biobase”)’, and for packages ‘citation(“pkgname”)’.

Loading required package: DelayedArray

Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

anyMissing, rowMedians

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:

apply

library(“RColorBrewer”) library(“gplots”)

Attaching package: ‘gplots’

The following object is masked from ‘package:IRanges’:

space

The following object is masked from ‘package:S4Vectors’:

space

The following object is masked from ‘package:stats’:

lowess

library(“amap”) library(“ggplot2”) library(“BiocParallel”)

多线程加速, 使用4个核

如果你电脑性能强大,可以把这个数变大

register(MulticoreParam(4))

  1. <a name="kL6UE"></a>
  2. ### 读入数据
  3. ```bash
  4. # 注意文件的位置,程序自己不知道文件在什么地方
  5. # 如果只给文件名,不给路径,程序只会在当前目录查找文件
  6. # 若找不到,则会报错
  7. # 可以使用下面命令设置工作目录到文件所在目录
  8. # 或提供文件全路径
  9. # setwd("~/data")
  10. data <- read.table("ehbio_trans.Count_matrix.xls", header=T, row.names=1,
  11. com='', quote='', check.names=F, sep="\t")
  12. # 撇掉在多于两个样本中count<1的值,如果样本数多,这个数值可以适当增加
  13. # 排除极低表达基因的干扰
  14. data <- data[rowSums(data)>2,]
  15. head(data)
  16. ## untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
  17. ## ENSG00000227232 13 25 23 24
  18. ## ENSG00000278267 0 5 3 4
  19. ## ENSG00000268903 0 2 0 0
  20. ## ENSG00000269981 0 3 0 1
  21. ## ENSG00000241860 3 11 1 5
  22. ## ENSG00000279457 46 90 73 49
  23. ## trt_N61311 trt_N052611 trt_N080611 trt_N061011
  24. ## ENSG00000227232 12 12 22 22
  25. ## ENSG00000278267 2 4 3 1
  26. ## ENSG00000268903 0 2 0 3
  27. ## ENSG00000269981 0 2 0 0
  28. ## ENSG00000241860 3 2 0 2
  29. ## ENSG00000279457 52 46 89 31
  30. # 读入分组信息
  31. sample <- read.table("sampleFile", header=T, row.names=1, com='',
  32. quote='', check.names=F, sep="\t", colClasses="factor")
  33. sample <- sample[match(colnames(data), rownames(sample)),, drop=F]
  34. sample_rowname <- rownames(sample)
  35. # 下面的可以忽略,如果没遇到错误不需要执行
  36. # 目的是做因子转换
  37. sample <- data.frame(lapply(sample, function(x) factor(x, levels=unique(x))))
  38. rownames(sample) <- sample_rowname
  39. sample
  40. ## conditions
  41. ## untrt_N61311 untrt
  42. ## untrt_N052611 untrt
  43. ## untrt_N080611 untrt
  44. ## untrt_N061011 untrt
  45. ## trt_N61311 trt
  46. ## trt_N052611 trt
  47. ## trt_N080611 trt
  48. ## trt_N061011 trt

产生DESeq数据集

DESeq2包采用DESeqDataSet存储原始的read count和中间计算的统计量。
每个DESeqDataSet对象都要有一个实验设计formula,用于对数据进行分组,以便计算表达值的离散度和估计表达倍数差异,通常格式为~ batch + conditions (为了方便后续计算,最为关注的分组信息放在最后一位)。
countData: 表达矩阵
colData: 样品分组信息表
design: 实验设计信息,conditions必须是colData中的一列

  1. ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
  2. colData = sample, design= ~ conditions)
  3. dds <- DESeq(ddsFullCountTable)
  4. ## estimating size factors
  5. ## estimating dispersions
  6. ## gene-wise dispersion estimates
  7. ## mean-dispersion relationship
  8. ## final dispersion estimates
  9. ## fitting model and testing

获取标准化后的数据

  1. # ?counts查看此函数功能
  2. # normalized=T, 返回标准化的数据
  3. normalized_counts <- counts(dds, normalized=TRUE)
  4. head(normalized_counts)
  5. ## untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
  6. ## ENSG00000227232 12.730963 21.179286 19.4979982 25.994727
  7. ## ENSG00000278267 0.000000 4.235857 2.5432172 4.332454
  8. ## ENSG00000268903 0.000000 1.694343 0.0000000 0.000000
  9. ## ENSG00000269981 0.000000 2.541514 0.0000000 1.083114
  10. ## ENSG00000241860 2.937915 9.318886 0.8477391 5.415568
  11. ## ENSG00000279457 45.048024 76.245430 61.8849507 53.072567
  12. ## trt_N61311 trt_N052611 trt_N080611 trt_N061011
  13. ## ENSG00000227232 13.423907 17.885811 15.750664 23.250144
  14. ## ENSG00000278267 2.237318 5.961937 2.147818 1.056825
  15. ## ENSG00000268903 0.000000 2.980969 0.000000 3.170474
  16. ## ENSG00000269981 0.000000 2.980969 0.000000 0.000000
  17. ## ENSG00000241860 3.355977 2.980969 0.000000 2.113649
  18. ## ENSG00000279457 58.170264 68.562276 63.718596 32.761567

根据基因在不同的样本中表达变化的差异程度mad值对数据排序,差异越大的基因排位越前。

  1. normalized_counts_mad <- apply(normalized_counts, 1, mad)
  2. normalized_counts <- normalized_counts[order(normalized_counts_mad, decreasing=T), ]
  3. # 标准化后的数据输出
  4. write.table(normalized_counts, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls",
  5. quote=F, sep="\t", row.names=T, col.names=T)
  6. # 只在Linux下能运行,目的是填补表格左上角内容
  7. system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls"))
  8. # log转换后的结果
  9. rld <- rlog(dds, blind=FALSE)
  10. rlogMat <- assay(rld)
  11. rlogMat <- rlogMat[order(normalized_counts_mad, decreasing=T), ]
  12. write.table(rlogMat, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls",
  13. quote=F, sep="\t", row.names=T, col.names=T)
  14. # 只在Linux下能运行,目的是填补表格左上角内容
  15. system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls"))

样品层级聚类分析,判断样品的相似性和组间组内差异

  1. # 生成颜色
  2. hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
  3. # 计算相关性pearson correlation
  4. pearson_cor <- as.matrix(cor(rlogMat, method="pearson"))
  5. # 层级聚类
  6. hc <- hcluster(t(rlogMat), method="pearson")
  7. # 热图绘制
  8. ## 在命令行下运行时,需要去除下面开头的#号,把输出的图保存到文件中
  9. ## 输出到文件时,dev.off()命令是关闭输出,从而完成图片的写入。如果不做这一步,则图片则不能写入,从而不能打开
  10. ## 如果在Rstudio或其它可视化界面时,可以直接把图输出到屏幕
  11. #pdf("ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf", pointsize=10)
  12. heatmap.2(pearson_cor, Rowv=as.dendrogram(hc), symm=T, trace="none",
  13. col=hmcol, margins=c(11,11), main="The pearson correlation of each
  14. sample")
  15. #dev.off()

image.png
样品PCA分析

  1. pca_data <- plotPCA(rld, intgroup=c("conditions"), returnData=T, ntop=5000)

差异基因分析

  1. # 定义变量的好处是下面的代码就独立了
  2. # 如果想比较不同的组,只需在这修改即可,后面代码都可以不动
  3. sampleA = "untrt"
  4. sampleB = "trt"

results函数提取差异基因分析结果,包含log2 fold changes, p values和adjusted p values. constrast用于指定比较的两组的信息,输出的log2FoldChange为log2(SampleA/SampleB)。

  1. contrastV <- c("conditions", sampleA, sampleB)
  2. res <- results(dds, contrast=contrastV)
  3. res
  4. ## log2 fold change (MLE): conditions untrt vs trt
  5. ## Wald test p-value: conditions untrt vs trt
  6. ## DataFrame with 26280 rows and 6 columns
  7. ## baseMean log2FoldChange lfcSE stat pvalue
  8. ## <numeric> <numeric> <numeric> <numeric> <numeric>
  9. ## ENSG00000227232 18.7141876 0.17695376 0.3809406 0.46451794 0.6422767
  10. ## ENSG00000278267 2.8144283 0.02802497 1.0249422 0.02734297 0.9781862
  11. ## ENSG00000268903 0.9807232 -1.71669654 2.3186888 -0.74037385 0.4590732
  12. ## ENSG00000269981 0.8256996 0.59077960 2.4538911 0.24075217 0.8097472
  13. ## ENSG00000241860 3.3713378 1.21773503 1.0789466 1.12863330 0.2590526
  14. ## ... ... ... ... ... ...
  15. ## 29256 0.3507226 1.8471273 3.484683 0.53007036 0.5960631
  16. ## 36308 0.9279831 0.1394489 1.598343 0.08724589 0.9304761
  17. ## 8414 0.7459738 -0.2678522 2.053434 -0.13044109 0.8962175
  18. ## 28837 1.2626343 0.2689443 1.405958 0.19128899 0.8482992
  19. ## 35320 1.0041030 0.5493562 1.717832 0.31979625 0.7491228
  20. ## padj
  21. ## <numeric>
  22. ## ENSG00000227232 0.8538515
  23. ## ENSG00000278267 NA
  24. ## ENSG00000268903 NA
  25. ## ENSG00000269981 NA
  26. ## ENSG00000241860 NA
  27. ## ... ...
  28. ## 29256 NA
  29. ## 36308 NA
  30. ## 8414 NA
  31. ## 28837 NA
  32. ## 35320 NA

给DESeq2的原始输出结果增加样品平均表达信息,使得结果更容易理解和解析。

  1. # 获得第一组数据均值
  2. baseA <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleA]
  3. if (is.vector(baseA)){
  4. baseMeanA <- as.data.frame(baseA)
  5. } else {
  6. baseMeanA <- as.data.frame(rowMeans(baseA))
  7. }
  8. colnames(baseMeanA) <- sampleA
  9. head(baseMeanA)
  10. ## untrt
  11. ## ENSG00000227232 19.8507436
  12. ## ENSG00000278267 2.7778822
  13. ## ENSG00000268903 0.4235857
  14. ## ENSG00000269981 0.9061570
  15. ## ENSG00000241860 4.6300269
  16. ## ENSG00000279457 59.0627429
  17. # 获得第二组数据均值
  18. baseB <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleB]
  19. if (is.vector(baseB)){
  20. baseMeanB <- as.data.frame(baseB)
  21. } else {
  22. baseMeanB <- as.data.frame(rowMeans(baseB))
  23. }
  24. colnames(baseMeanB) <- sampleB
  25. head(baseMeanB)
  26. ## trt
  27. ## ENSG00000227232 17.5776316
  28. ## ENSG00000278267 2.8509744
  29. ## ENSG00000268903 1.5378607
  30. ## ENSG00000269981 0.7452421
  31. ## ENSG00000241860 2.1126487
  32. ## ENSG00000279457 55.8031756
  33. # 结果组合
  34. res <- cbind(baseMeanA, baseMeanB, as.data.frame(res))
  35. head(res)
  36. ## untrt trt baseMean log2FoldChange lfcSE
  37. ## ENSG00000227232 19.8507436 17.5776316 18.7141876 0.17695376 0.3809406
  38. ## ENSG00000278267 2.7778822 2.8509744 2.8144283 0.02802497 1.0249422
  39. ## ENSG00000268903 0.4235857 1.5378607 0.9807232 -1.71669654 2.3186888
  40. ## ENSG00000269981 0.9061570 0.7452421 0.8256996 0.59077960 2.4538911
  41. ## ENSG00000241860 4.6300269 2.1126487 3.3713378 1.21773503 1.0789466
  42. ## ENSG00000279457 59.0627429 55.8031756 57.4329592 0.08926886 0.2929300
  43. ## stat pvalue padj
  44. ## ENSG00000227232 0.46451794 0.6422767 0.8538515
  45. ## ENSG00000278267 0.02734297 0.9781862 NA
  46. ## ENSG00000268903 -0.74037385 0.4590732 NA
  47. ## ENSG00000269981 0.24075217 0.8097472 NA
  48. ## ENSG00000241860 1.12863330 0.2590526 NA
  49. ## ENSG00000279457 0.30474470 0.7605606 0.9130580
  50. # 增加ID信息
  51. res <- cbind(ID=rownames(res), as.data.frame(res))
  52. res$baseMean <- rowMeans(cbind(baseA, baseB))
  53. # 校正后p-value为NA的复制为1
  54. res$padj[is.na(res$padj)] <- 1
  55. # 按pvalue排序, 把差异大的基因放前面
  56. res <- res[order(res$pvalue),]
  57. head(res)
  58. ## ID untrt trt baseMean
  59. ## ENSG00000152583 ENSG00000152583 77.8512 1900.412 989.1314
  60. ## ENSG00000148175 ENSG00000148175 7233.0010 19483.552 13358.2766
  61. ## ENSG00000179094 ENSG00000179094 151.7491 1380.881 766.3151
  62. ## ENSG00000134686 ENSG00000134686 1554.0390 4082.440 2818.2395
  63. ## ENSG00000125148 ENSG00000125148 1298.0392 5958.946 3628.4926
  64. ## ENSG00000120129 ENSG00000120129 773.9769 5975.522 3374.7494
  65. ## log2FoldChange lfcSE stat pvalue
  66. ## ENSG00000152583 -4.608311 0.21090819 -21.84984 7.800059e-106
  67. ## ENSG00000148175 -1.429585 0.08639329 -16.54741 1.671380e-61
  68. ## ENSG00000179094 -3.184674 0.20065896 -15.87108 1.005030e-56
  69. ## ENSG00000134686 -1.392749 0.09190323 -15.15451 7.073605e-52
  70. ## ENSG00000125148 -2.199191 0.14796048 -14.86337 5.698869e-50
  71. ## ENSG00000120129 -2.948122 0.19931242 -14.79146 1.663007e-49
  72. ## padj
  73. ## ENSG00000152583 1.331002e-101
  74. ## ENSG00000148175 1.426021e-57
  75. ## ENSG00000179094 5.716612e-53
  76. ## ENSG00000134686 3.017600e-48
  77. ## ENSG00000125148 1.944910e-46
  78. ## ENSG00000120129 4.729591e-46

整体分析结果输出到文件

  1. comp314 <- paste(sampleA, "_vs_", sampleB, sep=".")
  2. # 生成文件名
  3. file_base <- paste("ehbio_trans.Count_matrix.xls.DESeq2", comp314, sep=".")
  4. file_base1 <- paste(file_base, "results.xls", sep=".")
  5. write.table(as.data.frame(res), file=file_base1, sep="\t", quote=F, row.names=F)

提取差异表达基因

  1. # 差异基因筛选,padj<0.1
  2. res_de <- subset(res, res$padj<0.1, select=c('ID', sampleA,
  3. sampleB, 'log2FoldChange', 'padj'))
  4. # foldchang > 1
  5. res_de_up <- subset(res_de, res_de$log2FoldChange>=1)
  6. file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA,"_higherThan_",sampleB,
  7. 'xls', sep=".")
  8. write.table(as.data.frame(res_de_up), file=file, sep="\t", quote=F, row.names=F)
  9. res_de_dw <- subset(res_de, res_de$log2FoldChange<=(-1)*1)
  10. file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA, "_lowerThan_", sampleB,
  11. 'xls', sep=".")
  12. write.table(as.data.frame(res_de_dw), file=file, sep="\t", quote=F, row.names=F)
  13. # 差异基因ID
  14. res_de_up_id = data.frame(ID=res_de_up$ID,
  15. type=paste(sampleA,"_higherThan_", sampleB, sep="."))
  16. res_de_dw_id = data.frame(ID=res_de_dw$ID,
  17. type=paste(sampleA,"_lowerThan_", sampleB, sep="."))
  18. de_id = rbind(res_de_up_id, res_de_dw_id)
  19. file <- "ehbio_trans.Count_matrix.xls.DESeq2.all.DE"
  20. write.table(as.data.frame(de_id), file=file, sep="\t", quote=F, row.names=F, col.names=F)

绘制火山图

  1. # 可以把前面生成的results.xls文件提交到www.ehbio.com/ImageGP绘制火山图
  2. # 或者使用我们的s-plot https://github.com/Tong-Chen/s-plot
  3. logCounts <- log2(res$baseMean+1)
  4. logFC <- res$log2FoldChange
  5. FDR <- res$padj
  6. #png(filename=paste(file_base, "Volcano.png", sep="."))
  7. plot(logFC, -1*log10(FDR), col=ifelse(FDR<=0.01, "red", "black"),
  8. xlab="logFC", ylab="-1*log1o(FDR)", main="Volcano plot", pch=".")
  9. #dev.off()

image.png
差异基因热图

  1. res_de_up_top20_id <- as.vector(head(res_de_up$ID,20))
  2. res_de_dw_top20_id <- as.vector(head(res_de_dw$ID,20))
  3. red_de_top20 <- c(res_de_up_top20_id, res_de_dw_top20_id)
  4. red_de_top20
  5. ## [1] "ENSG00000178695" "ENSG00000196517" "ENSG00000116584"
  6. ## [4] "ENSG00000144369" "ENSG00000276600" "ENSG00000108821"
  7. ## [7] "ENSG00000162692" "ENSG00000181467" "ENSG00000145777"
  8. ## [10] "ENSG00000163491" "ENSG00000183160" "ENSG00000172986"
  9. ## [13] "ENSG00000164484" "ENSG00000131389" "ENSG00000134253"
  10. ## [16] "ENSG00000124766" "ENSG00000227051" "ENSG00000146006"
  11. ## [19] "ENSG00000112837" "ENSG00000146592" "ENSG00000152583"
  12. ## [22] "ENSG00000148175" "ENSG00000179094" "ENSG00000134686"
  13. ## [25] "ENSG00000125148" "ENSG00000120129" "ENSG00000189221"
  14. ## [28] "ENSG00000109906" "ENSG00000101347" "ENSG00000162614"
  15. ## [31] "ENSG00000096060" "ENSG00000162616" "ENSG00000166741"
  16. ## [34] "ENSG00000183044" "ENSG00000164125" "ENSG00000198624"
  17. ## [37] "ENSG00000256235" "ENSG00000077684" "ENSG00000135821"
  18. ## [40] "ENSG00000164647"
  19. # 可以把矩阵存储,提交到www.ehbio.com/ImageGP绘制火山图
  20. # 或者使用s-plot https://github.com/Tong-Chen/s-plot
  21. red_de_top20_expr <- normalized_counts[rownames(normalized_counts) %in% red_de_top20,]
  22. library(pheatmap)
  23. pheatmap(red_de_top20_expr, cluster_row=T, scale="row", annotation_col=sample)

image.png

关于批次效应

每个DESeqDataSet对象都要有一个实验设计formula,用于对数据进行分组,以便计算表达值的离散度和估计表达倍数差异,通常格式为~ batch + conditions (为了方便后续计算,最为关注的分组信息放在最后一位)。
如果记录了样本的批次信息,或者其它需要抹除的信息可以定义在design参数中,在下游回归的分析中会根据design formula来估计batch effect的影响,并在下游分析时减去这个影响。这是处理batch effect的推荐方式。
在模型中考虑batch effect并没有在数据矩阵中移除bacth effect,如果下游处理时,确实有需要可以使用limma包的removeBatchEffect来处理。
countData: 表达矩阵
colData: 样品分组信息表
design: 实验设计信息,batch和conditions必须是colData中的一列

  1. ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
  2. colData = sample, design= ~ batch + conditions)
  3. dds <- DESeq(ddsFullCountTable)
  4. rld <- rlog(dds, blind=FALSE)
  5. rlogMat <- assay(rld)
  6. rlogMat <- limma::removeBatchEffect(rlogMat, c(sample$batch))

Just to be clear, there’s an important difference between removing a batch effect and modelling a batch effect. Including the batch in your design formula will model the batch effect in the regression step, which means that the raw data are not modified (so the batch effect is not removed), but instead the regression will estimate the size of the batch effect and subtract it out when performing all other tests. In addition, the model’s residual degrees of freedom will be reduced appropriately to reflect the fact that some degrees of freedom were “spent” modelling the batch effects. This is the preferred approach for any method that is capable of using it (this includes DESeq2). You would only remove the batch effect (e.g. using limma’s removeBatchEffect function) if you were going to do some kind of downstream analysis that can’t model the batch effects, such as training a classifier.
批次效应模拟

  1. #Make some simulated data with a batch effect:
  2. dds <- makeExampleDESeqDataSet(betaSD=1,interceptMean=10)
  3. dds$batch <- factor(rep(c("A","B"),each=6))
  4. #VST, remove batch effect, then plotPCA:
  5. vsd <- vst(dds)
  6. plotPCA(vsd, "batch")

image.png

  1. assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
  2. plotPCA(vsd, "batch")

image.png
SVA(批次未记录时,寻找潜在影响因子,并矫正)

  1. dat <- counts(dds, normalized=TRUE)
  2. idx <- rowMeans(dat) > 1
  3. dat <- dat[idx,]
  4. mod <- model.matrix(~ dex, colData(dds))
  5. mod0 <- model.matrix(~ 1, colData(dds))
  6. ##calculating the variables
  7. n.sv <- num.sv(dat,mod,method="leek") #gives 11
  8. ### using 4
  9. lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
  10. co <- lnj.corr$corrected
  11. plPCA(co)

http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
#http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects

参考资料