1:对测序下机数据进行质量检测(QC)

使用软件:fastqc

命令行参数:

  1. -o --outdir:输出路径
  2. --extract:结果文件解压缩
  3. --noextract:结果文件压缩
  4. -f --format:输入文件格式.支持bam,sam,fastq文件格式
  5. -t --threads:线程数
  6. -c --contaminants:制定污染序列。文件格式 name[tab]sequence
  7. -a --adapters:指定接头序列。文件格式name[tab]sequence
  8. -k --kmers:指定kmers长度(2-10bp,默认7bp
  9. -q --quiet 安静模式

运行命令:

  1. fastqc -f fastq -t 8 -o ./QCreport/ fastq1 fastq2

注意:当样本数量很多时,我们可以采用 multiqc 软件将 fastqc 软件的结果进行合并

  1. multiqc ./*.zip

2.对 qc 后的数据进行过滤

使用软件:fastp

命令行参数:

  1. usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
  2. options:
  3. # I/O options 即输入输出文件设置
  4. -i, --in1 read1 input file name (string)
  5. -o, --out1 read1 output file name (string [=])
  6. -I, --in2 read2 input file name (string [=])
  7. -O, --out2 read2 output file name (string [=])
  8. -6, --phred64 indicates the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
  9. -z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 2. (int [=2])
  10. --reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
  11. # adapter trimming options 过滤序列接头参数设置
  12. -A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
  13. -a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
  14. --adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=])
  15. # global trimming options 剪除序列起始和末端的低质量碱基数量参数
  16. -f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
  17. -t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
  18. -F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
  19. -T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])
  20. # polyG tail trimming, useful for NextSeq/NovaSeq data polyG剪裁
  21. -g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
  22. --poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
  23. -G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
  24. # polyX tail trimming
  25. -x, --trim_poly_x enable polyX trimming in 3' ends.
  26. --poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
  27. # per read cutting by quality options 划窗裁剪
  28. -5, --cut_by_quality5 enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data)
  29. -3, --cut_by_quality3 enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)
  30. -W, --cut_window_size the size of the sliding window for sliding window trimming, default is 4 (int [=4])
  31. -M, --cut_mean_quality the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20])
  32. # quality filtering options 根据碱基质量来过滤序列
  33. -Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
  34. -q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
  35. -u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
  36. -n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
  37. # length filtering options 根据序列长度来过滤序列
  38. -L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
  39. -l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
  40. # low complexity filtering
  41. -y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
  42. -Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
  43. # filter reads with unwanted indexes (to remove possible contamination)
  44. --filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
  45. --filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
  46. --filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
  47. # base correction by overlap analysis options 通过overlap来校正碱基
  48. -c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
  49. # UMI processing
  50. -U, --umi enable unique molecular identifer (UMI) preprocessing
  51. --umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
  52. --umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
  53. --umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
  54. --umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
  55. # overrepresented sequence analysis
  56. -p, --overrepresentation_analysis enable overrepresented sequence analysis.
  57. -P, --overrepresentation_sampling One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])
  58. # reporting options
  59. -j, --json the json format report file name (string [=fastp.json])
  60. -h, --html the html format report file name (string [=fastp.html])
  61. -R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
  62. # threading options 设置线程数
  63. -w, --thread worker thread number, default is 3 (int [=3])
  64. # output splitting options
  65. -s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
  66. -S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
  67. -d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
  68. # help
  69. -?, --help print this message

运行命令:

  1. fastp -c -i fastq1 -o ./cleandata/ -I fastq2 -O ./cleandata/

3.对干净数据去除 rRNA 操作(存疑)

在 ncbi 下载 rRNA 数据保存为 hg19_rRNA.fasta

使用 bwa 建立索引文件

  1. bwa index hg19_rRNA.fasta

使用 bwa 将测序数据比对到 rRNA 数据库,去除比对上的数据,并恢复为 fastq 文件

  1. bwa mem hg19_rRNA.fasta fastq1 fastq2 | samtools view -bf4 |samtools fastq -c 9 -1 paired1.fq.gz -2 paired2.fq.gz -

4.将去除 rRNA 的数据比对到 hg19|hg38 上

使用软件:subread

建立索引

  1. subread-buildindex -o hg19 ./database/hg19.fasta

比对

  1. subjunc -T 5 -i hg19 -r fastq1 -R fastq2 -o out.bam

统计 readcounts

  1. featureCounts -p -T 6 -t exon -g transcript_id -a all_lncRNA_know.gtf -o sample.lncRNA.readcount out.bam

这里需要注意:因为一个基因会包含多个转录本,一个转录本又包含多个外显子和内含子,所以为了精确比对结果,我们需要基于外显子进行比对-t exon,软件会自动根据注释文件进行合并同一个转录本的计数结果,并计算出转录本的长度汇总到结果中.

另外,需要注意的是,正因为可变剪切的存在,我们不能简单的认为转录本的计数结果相加即为对应基因的计数结果,这是不准确的,所以,要想知道基因的计数结果,我们还需要修改参数,另行计算-g gene_id

将计数的结果提取出来

  1. awk -F '\t' '{print $1,$7}' OFS='\t' sample.lncRNA.readcount > ./tiqu/sample.counts

注:对于结果中的 length,有人以为就是有效长度(effctive length),但事实是,这个长度仅仅只是一个转录本上所有外显子长度的加和而已,因此,并不能作为有效长度。网上有估算有效长度的公式:

  1. efflength = transcripts_length - LFD +1

LFD 是测序时的平均片段长度,这个可以估算,但不准确,最好的方法是询问测序人员这个数值。

5.整合所有样本信息到表达矩阵并计算 TPM

计算完 readcount,转录组测序的打基础过程已经完毕,后续的分析都是基于此的,但是,为了下游分析的方便,我们有必要将所有的数据整理到一个表达矩阵中。

使用python或者R或者shell,我建议使用 python,因为简单快速易控:

  1. import pandas as pd
  2. import glob
  3. #使用pandas速度要更快,先预读一个样本作为dataframe的起始框架
  4. df=pd.read_csv("sample.lncRNA.trans.counts",sep="\t",skiprows=1)
  5. #featurecount的计算结果提取后通常有这样的表头
  6. df=df[['Geneid','LncRNA/bidui/normal/sample.subjunc.bam']]
  7. df.columns=['id','counts']
  8. #遍历所有样本,使用pd.join整合到一个大表
  9. for name in glob.glob("*.counts"):
  10. sample=name.replace(".lncRNA.trans.counts","")
  11. df2=pd.read_csv(name,sep="\t",skiprows=1)
  12. df2.columns=['id','counts','ot']
  13. df2=df2[['counts']]
  14. df2.columns=[sample]
  15. df=pd.concat([df,df2],axis=1)
  16. #删除预读样本信息,并写入文件
  17. df.drop('counts',axis=1).set_index("id").to_csv("normal_lncRNA_trans_allcounts.csv",sep="\t")

目前主流的表达量标准化方法多种多样,有 RPKM,FPKM,TPM,CPM 等,下游差异分析软件 deseq2 也有自己的标准化方法,但是,为了保证生物学意义,同时又保证数据的可比性,我还是建议选择 TPM。

reads Count

定义: 高通量测序中比对到 exon 上的 reads 数。可使用 featureCount 等软件进行计算。

优点:可有效说明该区域是否真的有表达及真实的表达丰度。能够近似呈现真实的表达情况。有利于实验验证。

缺点:由于 exon 长度不同,难以进行不同 exon 丰度比较;由于测序总数不同,难以对不同测序样本间进行比较。

RPKM/FPKM

定义:
RPKM: Reads Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的 reads);
FPKM: Fragments Per Kilobase of exon model per Million mapped fragments(每千个碱基的转录每百万映射读取的 fragments)
公式:

一文读懂lncRNA分析 - 图1

上述公式可从下面公式推导而出:

一文读懂lncRNA分析 - 图2

解释:ExonMappedReads 即为比对到该 exon 上的 reads count; TotalMappedReads 即为比对到基因组上所有 reads count 的总和;ExonLength 为该 Exon 的长度;GenomeLength 即为基因组全长,因为是相同基因组,所以该数值可消除。

优点:tophat-cufflinks 流程固定,应用范围广。理论上,可弥补 reads Count 的缺点,消除样本间和基因间差异。

讨论:有人说 RPKM/FPKM 标准化特别不合理,看着是个大牛 YellowTree。公式 2 中,TotalMappedReads/GenomeLength 为测序深度,ExonMappedReads / ExonLength 可以简单的认为是该 Exon 上的“测序深度”。两者相除,就得出该 Exon 依据测序深度而进行的标准化,那么因 Exon 长短、测序深度造成的样本间造成的偏差,都可以消除。因一般是相同物种,基因组一般相同,所以公式 2 换算并消去 GenomeLength,就成为公式 1 的形式了。不知道哪里错了,斗胆提出质疑:RPKM/FPKM 怎么就不能消除两种类型的 bias?不过有论文陈述说 RPKM 的结果难以消除组间测序造成的差异,可能未采用比对到基因组上所有的 reads 数,而是采用了比对到所有 Exon 的 reads 数作为 TotalMappedReads 吧。不是很确定。

FPKM:与 RPKM 计算过程类似。只有一点差异:RPKM 计算的是 reads,FPKM 计算的是 fragments。single-end/paired-end 测序数据均可计算 reads count,fragments count 只能通过 paired-end 测序数据计算。paired-end 测序数据时,两端的 reads 比对到相同区域,且方向相反,即计数 1 个 fragments;如果只有单端 reads 比对到该区域,则一个 reads 即计数 1 个 fragments。所以 fragments count 接近且小于 2 * reads count。

RPM

定义:RPM/CPM: Reads/Counts of exon model per Million mapped reads (每百万映射读取的 reads)

公式

一文读懂lncRNA分析 - 图3

优点:利于进行样本间比较。根据比对到基因组上的总 reads count,进行标准化。即:不论比对到基因组上的总 reads count 是多少,都将总 reads count 标准化为 10^6。

缺点:未消除 exon 长度造成的表达差异,难以进行样本内 exon 差异表达的比较。

TPM

定义:TPM: Transcripts Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的 Transcripts)

公式

一文读懂lncRNA分析 - 图4%7D%0A#card=math&code=TPM%3D%5Cfrac%7BN_i%2FL_i%2A10%5E6%7D%7Bsum%28N_1%2FL_1%2BN_2%2FL_2%2B%C2%B7%C2%B7%2BN_n%2FL_n%29%7D%0A)

解释:Ni 为比对到第 i 个 exon 的 reads 数; Li 为第 i 个 exon 的长度;sum(N1/L1+N2/L2 + … + Nn/Ln)为所有 (n 个)exon 按长度进行标准化之后数值的和。

计算过程:首先对每个 exon 计算 Pi=Ni/Li,即按长度对 reads count 进行标准化;随后计算过程类似 RPM (将 Pi 作为正常的 ExonMappedReads,然后以 RPM 的公式计算 TPM)。

优点:首先消除 exon 长度造成的差异,随后消除样本间测序总 reads count 不同造成的差异。

缺点:因为不是采用比对到基因组上的总 reads count,所以特殊情况下不够准确。例如:某突变体对 exon 造成整体影响时,难以找出差异。

相互关系

评价:以上几种计算 exon 表达丰度的方法,差异不是非常大。如果结果是显著的,那么采用上面任一计算方法大多均可找出显著结果。但是当表达风度差异不是那么显著时,不易区分不同类别,需要根据实际需要选择对应的标准化方法。

注意:以上 TotalMappedReads 推荐首选比对到基因组上的总 reads 数,而不是比对到 exon 或者 gene 上总 reads 数。这同样需要根据实际情况而确定。

如何根据 counts 计算 TPM?

这里摘取一段网络代码:

  1. countToTpm <- function(counts, effLen)
  2. {
  3. rate <- log(counts) - log(effLen)
  4. denom <- log(sum(exp(rate)))
  5. exp(rate - denom + log(1e6))
  6. }
  7. countToFpkm <- function(counts, effLen)
  8. {
  9. N <- sum(counts)
  10. exp( log(counts) + log(1e9) - log(effLen) - log(N) )
  11. }
  12. fpkmToTpm <- function(fpkm)
  13. {
  14. exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
  15. }
  16. countToEffCounts <- function(counts, len, effLen)
  17. {
  18. counts * (len / effLen)
  19. }
  20. ################################################################################
  21. # An example
  22. ################################################################################
  23. cnts <- c(4250, 3300, 200, 1750, 50, 0)
  24. lens <- c(900, 1020, 2000, 770, 3000, 1777)
  25. countDf <- data.frame(count = cnts, length = lens)
  26. # assume a mean(FLD) = 203.7
  27. countDf$effLength <- countDf$length - 203.7 + 1
  28. countDf$tpm <- with(countDf, countToTpm(count, effLength))
  29. countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
  30. with(countDf, all.equal(tpm, fpkmToTpm(fpkm)))
  31. countDf$effCounts <- with(countDf, countToEffCounts(count, length, effLength))

显而易见,要想通过 counts 计算 TPM,必须知道 count,efflength,但最难的也是这个有效长度

what is effective length of the feature?

I traced back to this paper by Lior pachter group Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

It is quite mathematical, but the general idea is:

If we take the fragment length to be fixed, then the effective length is how many fragments can occur in the transcript. This turns out to be length - frag_len +1. The number of fragments coming from a transcript will be proportional to this number, regardless of whether you sequenced one or both ends of the fragment. In turn, when it comes to probabilistically assigning reads to transcripts the effective length plays a similar role again. Thus for short transcripts, there can be quite a difference between two fragment lengths. To go back to your example if you have transcript of length 310, your effective length is 10 (if fragment length is 300) or 160 (if fragment length is 150) in either case, which explains the discrepancy you see.

From @Rob

The effective length is computed by using the fragment length distribution to determine the effective number of positions that can be sampled on each transcript. You can think of this as convolving the fragment length distribution with the characteristic function (the function that simply takes a value of 1) over the transcript. For example if we observe fragments of length 50 —- 1000, a position more than 1000 bases from the end of the transcript will contribute a value of 1 to the effective length, while a position 150 bases will contribute a value of F(150), where F is the cumulative distribution function of the fragment length distribution. For single end data, where we can’t learn an empirical FLD, we use a gaussian whose mean and standard deviation can be set with —fldMean and —fldSD respectively.

From Harold Pimentel’s post above. He is in Lior Pachter’s group.

Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as:

where uFDL is the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.

所以说,自己计算 efflength 相当麻烦而且可能出错,有没有简单的方法呢?

有: 使用 salmon 软件会进行伪比对计算 TPM 和 count 同时会得到对应转录本的有效长度,将其计算结果中有效长度提取出来即可。

首先,提取对应有效长度给 fc 计算结果:

  1. import pandas as pd
  2. import os,glob
  3. fc_dir="lncRNA/normal/tiqu_trans"
  4. sm_dir="LncRNA/salmon/normal/lncRNA"
  5. fc=glob.glob(os.path.join(fc_dir,"*.lncRNA.trans.counts"))
  6. for fs in fc:
  7. sample_name=fs.replace("lncRNA/normal/tiqu_trans/","").replace(".lncRNA.trans.counts","")
  8. df=pd.read_csv(fs,sep="\t",header=0,skiprows=1).iloc[:,[0,1]].set_index("Geneid")
  9. df.columns=['count']
  10. df2=pd.read_csv(os.path.join(sm_dir,"{}_quant".format(sample_name),"quant.sf"),sep="\t",header=0).set_index("Name")
  11. df1=df.join(df2[['EffectiveLength']])
  12. df1.to_csv(os.path.join(fc_dir,"{}.lncRNA.trans_eff.counts".format(sample_name)),sep="\t")

然后计算 TPM:

  1. options(scipen = 200)
  2. options(digits = 3)
  3. for (name in Sys.glob("*.lncRNA.trans_eff.counts")){
  4. countdata<-read.table(name,sep="\t",header = TRUE,row.names = "Geneid")
  5. countToTpm <- function(counts, effLen){
  6. rate <- log(counts) - log(effLen)
  7. denom <- log(sum(exp(rate)))
  8. exp(rate - denom + log(1e6))
  9. }
  10. countdata<-na.omit(countdata)
  11. countdata$TPM<-with(countdata,countToTpm(countdata$count,countdata$EffectiveLength))
  12. id<-unlist(strsplit(name,"[.]"))[1]
  13. write.table(countdata,paste(id,".lncRNA.trans.TPM",sep = ""),quote = FALSE,sep="\t")
  14. }

然后,将所有的 TPM 再次整合到表达矩阵中:

  1. import pandas as pd
  2. import glob
  3. df=pd.read_csv("sample.lncRNA.trans.TPM",sep="\t")
  4. df=df[['TPM']]
  5. for name in glob.glob("*.TPM"):
  6. sample=name.replace(".lncRNA.trans.TPM","")
  7. df2=pd.read_csv(name,sep="\t")
  8. df2=df2[['TPM']]
  9. df2.columns=[sample]
  10. df=df.join(df2)
  11. df.drop(['TPM'],axis=1).to_csv("normal_lncRNA_trans_TPM.csv",sep="\t")

现在,又有一个问题了。……
基因的表达量怎么算呢,基因的有效长度没法算啊,因为 salmon 只能计算转录本的表达量啊。这里,我们可以通过转录本推算基因有效长度:

  1. import argparse
  2. import pandas as pd
  3. import numpy as np
  4. import sys
  5. def main(args):
  6. gtable = pd.read_csv(args.ginput,skiprows=1,sep="\t").set_index('Geneid')
  7. gtable = gtable.iloc[:,0:1]
  8. gtable.columns = ['Count']
  9. ttable = pd.read_table(args.tinput).set_index('Name')
  10. tgmap = pd.read_table(args.tgmap, names=['t', 'g']).set_index('t')
  11. gene_lengths = {}
  12. j = 0
  13. # Map over all gene groups (a gene and its associated transcripts)
  14. for g, txps in tgmap.groupby('g').groups.iteritems():
  15. if j % 500 == 1:
  16. print("Processed {} genes".format(j))
  17. j += 1
  18. # The set of transcripts present in our salmon index
  19. tset = []
  20. for t in txps:
  21. if t in ttable.index:
  22. tset.append(t)
  23. # If at least one of the transcripts was present
  24. if len(tset) > 0:
  25. # The denominator is the sum of all TPMs
  26. totlen = ttable.loc[tset,'TPM'].sum()
  27. # Turn the relative TPMs into a proper partition of unity
  28. if totlen > 0:
  29. tpm_fracs = ttable.loc[tset, 'TPM'].values / ttable.loc[tset,'TPM'].sum()
  30. else:
  31. tpm_fracs = np.ones(len(tset)) / float(len(tset))
  32. # Compute the gene's effective length as the abundance-weight
  33. # sum of the transcript lengths
  34. elen = 0.0
  35. for i,t in enumerate(tset):
  36. elen += tpm_fracs[i] * ttable.loc[t, 'EffectiveLength']
  37. gene_lengths[g] = elen
  38. # Give the table an effective length field
  39. gtable['EffectiveLength'] = gtable.apply(lambda r : gene_lengths[r.name] if r.name in gene_lengths else 1.0, axis=1)
  40. # Write it to the output file
  41. gtable.to_csv(args.output, sep='\t', index_label='Name')
  42. if __name__ == "__main__":
  43. parser = argparse.ArgumentParser(description="Compute gene-level effective lengths")
  44. parser.add_argument('--ginput', type=str, help='gene level input table')
  45. parser.add_argument('--tinput', type=str, help='transcript level input table')
  46. parser.add_argument('--tgmap', type=str, help='transcript -> gene mapping')
  47. parser.add_argument('--output', type=str, help='output table with extra column')
  48. args = parser.parse_args()
  49. main(args)
  1. python inferlength.py --ginput sample.lncRNA.counts --tinput sample.quant.sf --tgmap lnc_tx2gene.txt --output sample.eff.lnc.counts

然后,老办法计算TPM,然后整合。

6.差异分析

差异分析有两种情况,一种是组间差异,一种是样本间差异,组间差异的意思就是,有两组样本做比较,每组样本默认其都是重复性实验,即两组进行比较;而样本间差异意思就是虽然也是两个组进行比较,但是每个组内的样本是独立的,之间也是有差异的。

对于两种情况,分别推荐组间差异分析使用deseq2;样本间差异分析使用degseq

通常情况下就是,使用 deseq2 捕获不到的差异,会被 degseq 得到,但是否符合心中预期,还需要进一步衡量,一般要做热图的话,degseq 得到的差异基因很难形成红白对照,聚类的话,样本和组很难聚在一起。

deseq2:
  1. library(DESeq2)
  2. path=getwd()
  3. normal<-read.table(paste(path,"/","normal_lncRNA_smcounts.csv",sep = ""),header = TRUE,row.names = "Name",check.names=FALSE)
  4. sick<-read.table(paste(path,"/","sick_lncRNA_smcounts.csv",sep = ""),header = TRUE,row.names = "Name",check.names=FALSE)
  5. mydata<-cbind(normal,sick)
  6. sample<-colnames(mydata)
  7. group_lst=c(rep("normal",length(colnames(normal))),rep("sick",length(colnames(sick))))
  8. colData <- data.frame(row.names=sample,group_list=group_lst)
  9. dds <- DESeqDataSetFromMatrix(countData = round(mydata),colData = colData,design = ~ group_list)
  10. register(MulticoreParam(4))
  11. dds2<-DESeq(dds,parallel=TRUE)
  12. saveRDS(dds2,paste(path,"/","lncRNA_normlized_by_deseq2_sm.rds",sep = ""))
  13. #suppressMessages(dds2)
  14. res <- results(dds2, contrast=c("group_list","normal","sick"))
  15. resOrdered <- res[order(res$padj),]
  16. resOrdered=as.data.frame(resOrdered)
  17. diff_gene_deseq <-subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
  18. write.table(diff_gene_deseq,file = paste(path,"/","lncRNA_top_diff2_sm.txt",sep = ""),quote = FALSE,sep="\t")
  19. plotdata<-counts(dds2, normalized = TRUE)
  20. write.table(plotdata,file = paste(path,"/","lncRNA_normlized_by_deseq2_sm.txt",sep = ""),quote = FALSE,sep="\t")

degseq:

  1. library(qvalue)
  2. library(samr)
  3. library(impute)
  4. library(matrixStats)
  5. library("DEGseq")
  6. path=getwd()
  7. normal<-read.table(paste(path,"/","normal_lncRNA_smcounts.csv",sep = ""),header = TRUE,row.names = "Name",check.names=FALSE)
  8. sick<-read.table(paste(path,"/","sick_lncRNA_smcounts.csv",sep = ""),header = TRUE,row.names = "Name",check.names=FALSE)
  9. mydata<-cbind(normal,sick)
  10. write.table(mydata,paste(path,"/","lncRNA_all_smcounts.txt",sep = ""),quote=FALSE,sep="\t")
  11. matrix1 <- readGeneExp(paste(path,"/","lncRNA_all_smcounts.txt",sep = ""), geneCol=1, valCol=c(2:83))
  12. matrix2 <- readGeneExp(paste(path,"/","lncRNA_all_smcounts.txt",sep = ""), geneCol=1, valCol=c(84:207))
  13. DEGexp(geneExpMatrix1=matrix1, geneCol1=1, expCol1=c(2:83), groupLabel1="nomal",
  14. geneExpMatrix2=matrix2, geneCol2=1, expCol2=c(2:125), groupLabel2="sick",
  15. pValue=1e-3, zScore=4, qValue=1e-3, foldChange=2, thresholdKind=5,
  16. method="MARS", outputDir=path)

合并 case,control 的 TPM 到一个矩阵:
  1. path=getwd()
  2. normal<-read.table(paste(path,"/","normal_mRNA_smTPM.csv",sep = ""),row.names="Name",header = TRUE,check.names=FALSE)
  3. sick<-read.table(paste(path,"/","sick_mRNA_smTPM.csv",sep = ""),row.names="Name",header = TRUE,check.names=FALSE)
  4. mydata<-cbind(normal,sick)
  5. write.table(mydata,"mRNA.salmon.TPM.txt",sep="\t",quote=FALSE)

将表达量为 0 的过滤掉:
  1. path=getwd()
  2. mydata<-read.table(paste(path,"mRNA.salmon.TPM.txt",sep="/"),header=TRUE)
  3. #这里也可以在read.table里面添加check.names=FALSE
  4. colnames(mydata)<-gsub("X","",colnames(mydata))
  5. top_m<-read.table(paste(path,"top_mRNA_degseq_fliter.csv",sep="/"),header=TRUE,sep="\t")
  6. top_m_TPM<-mydata[top_m$GeneNames,]
  7. top_m_filter<-top_m_TPM[apply(top_m_TPM,1,mean)>0,]
  8. write.table(top_m_filter,"top_m_filter.txt",sep="\t",quote=FALSE)

7.对获取到的差异转录本做 GO PATHWAY 富集分析

推荐使用 Y 叔的clusterprofile,这个软件易用,注释信息非常全面,功能很完善,做富集分析和注释,也可以使用在线的数据库注释,这里推荐几个非常好的网站:

DAVID这个网站被诟病许久,大家都在吐槽它更新缓慢的数据库,但是用起来还是真香,滑稽。

metascape是我用过的迄今为止最好用的网站,它可以接受几乎所有的 id,做的分析有一般分析和自定义分析,自定义分析里面可以帮助挑选出我们想要的通路,得到其对应的基因,这个真的非常好用,而且它有一个 excel 插件,可以直接在 Excel 里面运行,

webgestalt这个可能我还没发觉它真正的功能,我用着不好用,富集不全,但是图做的好,这个没得黑。

BioDbNet这个网站简直全能,可以做 id 转换,还可以做全注释,强无敌!!!

对 LncRNA 的注释,可以使用 R 包LncPath,这个包用法简单,但最好需要配合WGCNA使用

8.对得到的差异基因做可视化

目前我所做的可视化无非两种,一种是聚类热图,另一种是箱线图或者小提琴图,一种应用于大量基因,另一种是少量基因的情况。

热图很简单,直接用表达矩阵和分组信息做就可以了,默认参数,数据最好 log2(n+1)

箱线图同理,ggplot2 就可以做了

另外一种图是网络交互图,使用 igraph

9.寻找 lncRNA 的靶基因

针对 lncRNA 研究,其中比较关键的点是确定 lncRNA 的靶基因,如果能够有方法预测到 lncRNA 的靶基因,会减轻我们的工作量。而如何去测 lncRNA 的靶基因,可以从 lncRNA 不同的作用模式入手。

lncRNAs 作用范围广泛,机制非常复杂。根据 lncRNA 不同的作用模式比如顺式(cis)和反式(trans)之分,比如 Signal,Decoy,Duide,Scaffold,也可以根据 lncRNA 与不同的分子分为 DNA、RNA 和蛋白,总体上包括了转录和转录后水平。

根据顺式(cis)和反式(trans)的作用模式来预测

RNA 聚合酶转录得到一个与调控蛋白相关的 LncRNA 从而影响附近区域的编码基因。

cis 作用靶基因预测,认为 LncRNA 的功能与其坐标临近的蛋白编码基因相关,位于编码蛋白上下游的 LncRNA 可能与启动子或者共表达基因的其他顺式作用元件有交集,从而在转录或者转录后水平对基因的表达进行调控。判断一个 LncRNA 具有 Cis 调控作用通常要同时满足以下几个条件:

  • 附近的基因表达情况与其保持一致;
  • 该基因失活后会影响周围基因的表达;
  • 会影响附近同一位点的基因表达。

对于满足以上条件的 LncRNA,首先找出位于其上游或者下游附近(10K)的编码蛋白基因,通过对编码蛋白的功能富集分析,从而预测 LncRNA 的主要功能,为后续 Cis 作用分析打下基础。

LncRNA 与 DNA 结合蛋白相关联,并调控相关靶基因的表达。

Trans 作用靶基因预测基本原理认为 LncRNA 的功能跟编码基因的位置关系没有关系,而与其共表达的蛋白编码基因相关。也就是说,当 LncRNA 与一些距离较远的基因在表达量上存在正相关或负相关的情况时,就可以通过样本间 lncRNA 与蛋白编码基因的表达量相关性分析或 WGCNA 共表达分析来预测其靶基因。

需要注意的是,lncRNA 的 Trans 作用靶基因预测只适合于样本量大的情况,如果样本量太少(如 6 个以下),分析将不可靠。

所以,我的分析方法是,首先分析相关性,使用 p 值(FDR)和相关性参数确定相关性较强的基因作为备选

然后,分析 mRNA 上游 10K 和下游 20K 的 lncRNA 作为 cis;不在范围内的使用 RNAplex 计算结合能,结合能小于-30 作为 trans

计算相关性:

  1. library("fdrtool")
  2. base<-getwd()
  3. mydata<-read.table(paste(base,"top_m_filter.txt",sep="/"),header=TRUE,check.names=FALSE)
  4. mydata<-as.data.frame(t(mydata))
  5. chayi<-read.table(paste(base,"top_lnc_filter.txt",sep="/"),header=TRUE,check.names=FALSE)
  6. chayi<-as.data.frame(t(chayi))
  7. core<-function(x,y){
  8. cor.test(x,y)$estimate
  9. }
  10. pval<-function(x,y){
  11. cor.test(x,y)$p.value
  12. }
  13. fdr<-function(x){
  14. fdr<-fdrtool(x,statistic="pvalue")
  15. fdr$qval
  16. }
  17. mycor<-data.frame(test=rep(0,length(colnames(mydata))))
  18. for (i in 1:length(colnames(chayi))){
  19. haha<-0
  20. for (j in 1:length(colnames(mydata))){
  21. xiangguan<-core(chayi[,i],mydata[,j])
  22. haha<-c(haha,xiangguan)
  23. }
  24. mycor<-cbind(mycor,as.data.frame(haha)[-1,])
  25. }
  26. mycor<-mycor[,-1]
  27. colnames(mycor)<-colnames(chayi)
  28. rownames(mycor)<-colnames(mydata)
  29. write.table(mycor,paste(base,"correlation2.txt",sep="/"),quote=FALSE)
  30. myp<-data.frame(test=rep(0,length(colnames(mydata))))
  31. for (i in 1:length(colnames(chayi))){
  32. haha<-0
  33. for (j in 1:length(colnames(mydata))){
  34. xiangguan<-pval(chayi[,i],mydata[,j])
  35. haha<-c(haha,xiangguan)
  36. }
  37. myp<-cbind(myp,as.data.frame(haha)[-1,])
  38. }
  39. myp<-myp[,-1]
  40. colnames(myp)<-colnames(chayi)
  41. rownames(myp)<-colnames(mydata)
  42. mfdr<-apply(myp,2,fdr)
  43. write.table(mfdr,paste(base,"p_values(FDR)2.txt",sep="/"),quote=FALSE)
  44. ##guolv
  45. haha<-data.frame("correlation"=0,"mRNA"=0,"lncRNA"=0)
  46. xiangguan<-read.table(paste(base,"correlation2.txt",sep="/"),header = TRUE)
  47. for (i in colnames(xiangguan)){
  48. if (length(row.names(subset(xiangguan,abs(xiangguan[,i])>0.6,select = i)))==0){
  49. next
  50. }else{
  51. xiang<-subset(xiangguan,abs(xiangguan[,i])>0.6,select = i)
  52. colnames(xiang)<-"correlation"
  53. xiang$mRNA<-rownames(xiang)
  54. xiang$lncRNA<-i
  55. rownames(xiang)<-NULL
  56. haha<-rbind(haha,xiang)
  57. }
  58. }
  59. heihei<-data.frame('p_values(FDR)'=0,"mRNA"=0,"lncRNA"=0)
  60. colnames(heihei)[1]<-"p_values(FDR)"
  61. xiangguan<-read.table(paste(base,"p_values(FDR)2.txt",sep="/"),header = TRUE)
  62. for (i in colnames(xiangguan)){
  63. if (length(row.names(subset(xiangguan,xiangguan[,i]<0.05,select = i)))==0){
  64. next
  65. }else{
  66. xiang<-subset(xiangguan,xiangguan[,i]<0.05,select = i)
  67. colnames(xiang)<-"p_values(FDR)"
  68. xiang$mRNA<-rownames(xiang)
  69. xiang$lncRNA<-i
  70. rownames(xiang)<-NULL
  71. heihei<-rbind(heihei,xiang)
  72. }
  73. }
  74. haha<-haha[-1,]
  75. heihei<-heihei[-1,]
  76. fin<-merge(haha,heihei,by.x = c("lncRNA","mRNA"),by.y = c("lncRNA","mRNA"))
  77. write.table(fin,paste(base,"xiangguanxing2.txt",sep="/"),quote = FALSE,row.names = FALSE)

note: 这里计算的方法是使用 for 循环迭代结算,效率较低,不适合与大量数据,而且这里两个数据表一定要有相同的样本顺序,不然计算的相关性和 P 值会有显著的不同。最后,对于 FDR 矫正,这里只是针对一个 lncRNA 和多个 mRNA 的 P 值做了矫正。

  1. library(Hmisc)
  2. library(reshape2)
  3. library(fdrtool)
  4. lncRNA<-read.table("lncRNA_difference.txt",header = TRUE,check.names = FALSE,stringsAsFactors = FALSE,sep="\t")
  5. mRNA<-read.table("mRNA_difference.txt",header = TRUE,check.names = FALSE,stringsAsFactors = FALSE,sep="\t")
  6. n_lnc<-nrow(lncRNA)
  7. n_mRNA<-nrow(mRNA)
  8. alldata<-rbind(lncRNA,mRNA)
  9. mydata<-as.matrix(t(alldata))
  10. p<-rcorr(mydata)
  11. cor<-p[["r"]][1:n_lnc,(n_lnc+1):nrow(alldata)]
  12. pvalue<-p[["P"]][1:n_lnc,(n_lnc+1):nrow(alldata)]
  13. cor<-cbind(rownames(cor),cor)
  14. colnames(cor)[1]<-"LncRNA"
  15. pvalue<-cbind(rownames(pvalue),pvalue)
  16. colnames(pvalue)[1]<-"LncRNA"
  17. new_cor<-melt(as.data.frame(cor),id.vars="LncRNA",variable.name = "mRNA",value.name = "correlation")
  18. new_p<-melt(as.data.frame(pvalue),id.vars="LncRNA",variable.name = "mRNA",value.name = "p.value")
  19. jiaozheng<-function(x){fdrtool(x,statistic = "pvalue",plot = FALSE,verbose = FALSE)$qval}
  20. all_cor<-merge(new_cor, new_p, by.x=c("LncRNA","mRNA"), by.y=c("LncRNA","mRNA"))
  21. all_cor<-na.omit(all_cor)
  22. all_cor$p.value<-as.numeric(all_cor$p.value)
  23. all_cor$correlation<-as.numeric(all_cor$correlation)
  24. all_cor$p.asjust<-jiaozheng(all_cor$p.value)
  25. all_fli<-subset(all_cor,abs(all_cor$correlation)>0.6&all_cor$p.asjust<0.05)
  26. write.table(all_fli,"xiangguanxing.txt",quote=FALSE,row.names=FALSE)

这个脚本可以解决迭代问题,而且避免了样本不一致的情况。
计算 cis

首先,获取 bed 文件

  1. anno_m<-read.table("LncRNA/readscounts/gene_hit/hg19_all_transcript.bed",sep = "\t",header = FALSE)
  2. anno_lnc<-read.table("LncRNA/readscounts/gene_hit/NONCODE.bed",sep = "\t",header = FALSE)
  3. chayi<-read.table("xiangguanxing2.txt",header = TRUE)
  4. lncRNA<-subset(anno_lnc,V4%in%chayi$lncRNA)
  5. mRNA<-subset(anno_m,V4%in%chayi$mRNA)
  6. write.table(lncRNA,"lncRNA.bed",sep = "\t",row.names = FALSE,col.names = FALSE,quote = FALSE)
  7. write.table(mRNA,"mRNA.bed",sep = "\t",row.names = FALSE,col.names = FALSE,quote = FALSE)

所有基因的 bed 文件 noncode.bed 和 hg19.bed 可以在网上下载,也可以自己做,自己做的话使用 python 的 gtf_parse 包对 gtf 文件进行解析,然后做一个就行了

然后,使用 bedtools 计算位置信息:

  1. windowBed -a mRNA.bed -b LncRNA.bed -l 10240 -r 20480 -sm > cis.txt

计算 trans 结合能,需要首先获取转录组序列,然后运行 RNAplex

  1. import pandas as pd
  2. import subprocess,os,re
  3. from Bio import SeqIO
  4. hg19="LncRNA/testdir/all_mRNA_know.fa"
  5. noncode="LncRNA/testdir/all_lncRNA_know.fa"
  6. base="LncRNA/readscounts/salmon/gene_hit"
  7. hg19_fasta={seq_record.id:str(seq_record.seq) for seq_record in SeqIO.parse(hg19,'fasta')}
  8. lnc_fasta={seq_record.id:str(seq_record.seq) for seq_record in SeqIO.parse(noncode,'fasta')}
  9. def get_energy(lncRNA,mRNA):
  10. fa="lnc_tmp.fasta"
  11. fa2="mRNA_tmp.fasta"
  12. with open(fa2,'w') as f:
  13. f.write(">{}\n{}".format(mRNA,hg19_fasta[mRNA]))
  14. with open(fa,'w') as l:
  15. l.write(">{}\n{}".format(lncRNA,lnc_fasta[lncRNA]))
  16. get_energy=['app/RNAplex-0.2/RNAplex/bin/RNAplex','-t',fa2,'-q',fa]
  17. cm3=subprocess.Popen(get_energy,stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
  18. out,err=cm3.communicate()
  19. ans=0
  20. for line in out.decode("utf-8").splitlines():
  21. if not line.startswith(">"):
  22. ans=re.findall("\(-?\w+.?\w+\)",line)[0].replace("(","").replace(")","")
  23. #print(ans)
  24. os.remove(fa2)
  25. os.remove(fa)
  26. return ans
  27. with open("xiangguanxing2.txt") as f:
  28. for line2 in f:
  29. if line2.startswith("NONH"):
  30. lncRNA,mRNA,*ot=line2.split(" ")
  31. print("{}--{}:".format(lncRNA,mRNA),get_energy(lncRNA,mRNA))

注意:这里想要获取转录本序列,千万不要使用 bedtools getfast,因为这样获取的是整个转录本的基因组序列,包含内含子,所以,我们使用 cufflinks 软件中 gffread 命令来获取转录本序列

  1. gffread -w out.fa -g hg19.fa in.gtf

根据 lncRNA 结合的核酸序列来预测

实际上这种方式是最容易想到的,因为 lncRNA 是核酸,而核酸与核酸之间是有碱基互补配对的,不管是 lncRNA 与 RNA(比如 microRNA,mRNA)之间还是 lncRNA 与 DNA 之间,我们可以根据互补配对这一特性进行预测。这里我们展开举三个例子:

A、lncRNA-microRNA-mRNA

microRNA 对 mRNA 靶基因预测的方式可以反过来用于 lncRNA 靶基因预测,这里适用的模式就是 lncRNA 作为 microRNA sponge 吸附 microRNA,或者进一步通过 ceRNA 的作用机制调控靶基因,原理都是 microRNA 与 lncRNA 和 mRNA 的序列结合,这种方式也是现在预测靶基因用的最多的,比如网站:starbase(http://starbase.sysu.edu.cn/)。

B、lncRNA-mRNA

在上面这种模式里,lncRNA 对靶基因 mRNA 的调控是通过 microRNA 来介导的,当然 lncRNA 与 mRNA 之间的直接互补也能帮我们预测靶基因,这种模式我们以前说过,比较常见的是反义 lncRNA(Antisense lncRNA)通过与 mRNA 结合形成 RNA 二聚体https://www.jianshu.com/p/92451bd7c030,保护 mRNA 使之更难被 RNA 酶降解。因此这种模式来预测靶基因的原理是预测 lncRNA 序列与 mRNA 序列结合的自由能,自由能越小,越容易结合。这种模式的在线工具有 lncRNATargets(http://www.herbbol.org:8001/lrt/)

C、lncRNA-DNA

在以前文章中我们也介绍过 lncRNA 通过结合单链 DNA 发挥调控作用的:

因此,基于同样的原理,理论上我们也可以进行预测,这种预测工具 lncRNATargets 同样可实现。

根据蛋白的结合特性来预测:

lncRNA 核酸与核酸结合依据的是碱基配对原理,同样某些蛋白与核酸的结合也有一定规律,比如 RNA 结合蛋白,因此,我们可以依据这些规律预测 lncRNA 的结合蛋白,比如工具包括 RBPDB(http://rbpdb.ccbr.utoronto.ca/index.php)、RNAcommender(http://rnacommender.disi.unitn.it/)等。