edgeR - 图1
© Mark D. Robinson

由於語法渲染問題而影響閱讀體驗, 請移步博客閱讀~
本文GitPage地址

edgeR: empirical analysis of DGE in R

cite: Mark D. Robinson, Davis J. McCarthy, Gordon K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, Volume 26, Issue 1, 1 January 2010, Pages 139–140, https://doi.org/10.1093/bioinformatics/btp616

  • An overdispersed Poisson model is used to account for both biological and technical variability.
  • Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
  • The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated.

Why EdgeR

  • For microarrays, the abundance of a particular transcript is measured as afluorescence intensity, effectively a continuous response
  • Digital gene expression (DGE) data the abundance is observed as a count
  • Therefore, procedures that are successful for microarray data are not directly applicable to DGE data
  • . edgeR is designed for the analysis of replicated count-based expression data and is an implementation of methology developed by Robinson and Smyth[1][2].
  • It initially developed for serial analysis of gene expression (SAGE)
    As a result, edgeR may also be useful in other experiments that generate counts, such as ChIP-seq, in proteomics experiments where spectral counts are used to summarize the peptide abundance[3] or in barcoding experiments where several species are counted [4].

Digital gene expression: Digital gene expression (DGE) is a sequence-based approach for gene expression analyses, that generates a digital output at an unparalleled level of sensitivity[5].

Serial analysis of gene expression (SAGE): Serial analysis of gene expression, or SAGE, is an experimental technique designed to gain a direct and quantitative measure of gene expression. The SAGE method is based on the isolation of unique sequence tags (9-10 bp in length) from individual mRNAs and concatenation of tags serially into long DNA molecules for a lump-sum sequencing[6].

Spam test
Spam test2


Method

In limma (Smyth, 2004), where an empirical Bayes model is used to moderate the probe-wise variances.

In edgeR:
We assume the data can be summarized into a table of counts
We model the data as negative binomial (NB) distributed

edgeR - 图2%0A#card=math&code=Y%20%7Bgi%7D%20%5Csim%20NB%28M%20i%20p_%20%7Bgj%7D%2C%5Cphi_g%29%0A)

For gene edgeR - 图3 and sample edgeR - 图4:
edgeR - 图5: the library size (total number of reads),
edgeR - 图6: the dispersion
edgeR - 图7: is the relative abundance of gene edgeR - 图8 in experimental group edgeR - 图9 to which sample edgeR - 图10 belongs.

We use the NB parameterization where:

  • the mean is edgeR - 图11
  • the variance is edgeR - 图12#card=math&code=%CE%BC%20%7Bgi%7D%281%2B%20%5Cmu%20%20%7Bgi%7D%20%5Cphi%20_g%29)

For differential expression analysis:

  • the parameters of interest are edgeR - 图13.

The NB distribution is reduced to Poisson when $ \phi_g = 0$.

In some DGE applications, technical variation can be treated as Poisson.
In general, edgeR - 图14 represents the coefficient of variation of biological variation between the samples. In this way, our model is able to separate biological from technical variation.

limma: dispersion estimates -> topTags: tabulate the top differentially expressed genes
-> plotSmear: MA plot

More

There are a few terms and algorithms I do not understand. So, I’ll update this page later.


Enjoy~

本文由Python腳本GitHub/語雀自動更新

由於語法渲染問題而影響閱讀體驗, 請移步博客閱讀~
本文GitPage地址

GitHub: Karobben
Blog:Karobben
BiliBili:史上最不正經的生物狗


  1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance, Bioinformatics, 2007, vol. 23 (pg. 2881-2887) ↩︎

  2. [Robinson MD, Smyth GK. Small sample estimation of negative binomial dispersion, with applications to SAGE data, Biostatistics, 2008, vol. 9 (pg. 321-332)] ↩︎

  3. Andersson AF, et al. Comparative analysis of human gut microbiota by barcoded pyrosequencing, PLoS ONE, 2008, vol. 3 pg. e2836 ↩︎

  4. Wong JWH, et al. Computational methods for the comparative quantification of proteins in label-free LCn-MS experiments, Brief. Bioinform., 2008, vol. 9 (pg. 156-165) ↩︎

  5. Rodríguez-Esteban, G., González-Sastre, A., Rojo-Laguna, J.I. et al. Digital gene expression approach over multiple RNA-Seq data sets to detect neoblast transcriptional changes in Schmidtea mediterranea . BMC Genomics 16, 361 (2015). https://doi.org/10.1186/s12864-015-1533-1 ↩︎

  6. Yamamoto M, Wakatsuki T, Hada A, Ryo A. Use of serial analysis of gene expression (SAGE) technology. J Immunol Methods. 2001 Apr;250(1-2):45-66. doi: 10.1016/s0022-1759(01)00305-2. PMID: 11251221. ↩︎