image.png

准备

image.png

参考基因组 4大常用数据库

Ensembl: www.ensembl.org (主要)
NCBI:https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml
UCSC:http://www.genome.ucsc.edu/
gencode: https://www.gencodegenes.org/

Ensembl

www.ensembl.org
image.png

image.png物种数据表 http://asia.ensembl.org/info/about/species.html

用ctrl + F 浏览 注意 物种名, Taxon ID, 和 组装版本 和 NCBI ID 对应关系
image.png

http://asia.ensembl.org/Homo_sapiens/Info/Annotation
人的基因注释统计信息
image.png

ucsc

https://hgdownload.soe.ucsc.edu/downloads.html
image.pngimage.png

gencode

https://www.gencodegenes.org/human/stats.html
可查询统计信息 和 ensembl有差别
image.png

参考数据下载

image.pngimage.png


image.png
image.png
http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/
http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/cdna/
http://ftp.ensembl.org/pub/release-106/gtf/
http://ftp.ensembl.org/pub/release-106/gff3/

shell下载数据

  1. ###最好最新release 版本 ###
  2. ## 4个文件
  3. # 下载基因组序列axel curl 文件较大 458M
  4. nohup wget -c http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz >dna.log &
  5. tail dna.log # 查看
  6. # 下载转录组cDNA参考序列 72M
  7. nohup wget -c http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz >rna.log &
  8. tail rna.log # 查看
  9. # 下载基因组注释文件
  10. nohup wget -c http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.chr.gtf.gz >gtf.log &
  11. nohup wget -c http://ftp.ensembl.org/pub/release-105/gff3/homo_sapiens/Homo_sapiens.GRCh38.105.chr.gff3.gz >gff.log&

解压

  1. # 再次强调,一定要在文件下载完后再进行解压!!!
  2. nohup gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Homo_sapiens.GRCh38.cdna.all.fa.gz >unzip.log &

fasta 介绍

主要看DNA和cDNA 的fasta
image.png

基因注释文件 GTF, GFF3

GFF3 基因组注释

image.png
image.png
第3列 重要,基因的属性,

  1. zless -S XXX |grep -v '#'| cut -f 3 | sort | uniq -c

GTF文件 注释基因

image.png

基因ID解析

人默认没有物种前缀
image.png

思考题

fasta官方参考 https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
参考基因N的问题
N是正常的,’N’表示gap或者未注释区域,常见于端粒和着丝点区域,GRCH37共有234350281个‘N’,而GRCH38中有150630719个,减少了83719562个,占比35.7%。每次版本更新,N都会越少,所以尽量选择较新版的参考数据,结果会更加准确。可能以后长度长测序技术会慢慢克服N的问题。
参考基因的大小写区别
官方允许小写字母,并会被转作大写字母,有些程序对大小写有明确要求。
参考基因数据库的readme
包括对每个文件的描述,表头注释,举例说明,例外情况

  1. #### README ####
  2. IMPORTANT: Please note you can download subsets of data via the
  3. BioMart data mining tool.
  4. See https://www.ensembl.org/info/data/biomart/ for more information.
  5. The genome assembly represented here corresponds to GenBank Assembly ID
  6. GCA_000001405.28
  7. #######################
  8. Fasta DNA dumps
  9. #######################
  10. -----------
  11. FILE NAMES
  12. ------------
  13. The files are consistently named following this pattern:
  14. <species>.<assembly>.<sequence type>.<id type>.<id>.fa.gz
  15. <species>: The systematic name of the species.
  16. <assembly>: The assembly build name.
  17. <sequence type>:
  18. * 'dna' - unmasked genomic DNA sequences.
  19. * 'dna_rm' - masked genomic DNA. Interspersed repeats and low
  20. complexity regions are detected with the RepeatMasker tool and masked
  21. by replacing repeats with 'N's.
  22. * 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions
  23. have been replaced with lowercased versions of their nucleic base
  24. <id type> One of the following:
  25. * 'chromosome' - The top-level coordinate system in most species in Ensembl
  26. * 'nonchromosomal' - Contains DNA that has not been assigned a chromosome
  27. * 'seqlevel' - This is usually sequence scaffolds, chunks or clones.
  28. -- 'scaffold' - Larger sequence contigs from the assembly of shorter
  29. sequencing reads (often from whole genome shotgun, WGS) which could
  30. not yet be assembled into chromosomes. Often more genome sequencing
  31. is needed to narrow gaps and establish a tiling path.
  32. -- 'chunk' - While contig sequences can be assembled into large entities,
  33. they sometimes have to be artificially broken down into smaller entities
  34. called 'chunks'. This is due to limitations in the annotation
  35. pipeline and the finite record size imposed by MySQL which stores the
  36. sequence and annotation information.
  37. -- 'clone' - In general this is the smallest sequence entity. It is often
  38. identical to the sequence of one BAC clone, or sequence region
  39. of one BAC clone which forms the tiling path.
  40. <id>: The actual sequence identifier. Depending on the <id type> the <id>
  41. could represent the name of a chromosome, a scaffold, a contig, a clone ..
  42. Field is empty for seqlevel files
  43. fa: All files in these directories represent FASTA database files
  44. gz: All files are compacted with GNU Zip for storage efficiency.
  45. EXAMPLES
  46. The genomic sequence of human chromosome 1:
  47. Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
  48. The masked version of the genome sequence on human chromosome 1
  49. (contains '_rm' or '_sm' in the name):
  50. Homo_sapiens.GRCh37.dna_rm.chromosome.1.fa.gz
  51. Homo_sapiens.GRCh37.dna_sm.chromosome.1.fa.gz
  52. Non-chromosomal assembly sequences:
  53. e.g. mitochondrial genome, sequence contigs not yet mapped on chromosomes
  54. Homo_sapiens.GRCh37.dna.nonchromosomal.fa.gz
  55. Homo_sapiens.GRCh37.dna_rm.nonchromosomal.fa.gz
  56. Homo_sapiens.GRCh37.dna_sm.nonchromosomal.fa.gz
  57. ---------
  58. TOPLEVEL
  59. ---------
  60. These files contains all sequence regions flagged as toplevel in an Ensembl
  61. schema. This includes chromsomes, regions not assembled into chromosomes and
  62. N padded haplotype/patch regions.
  63. EXAMPLES
  64. Toplevel sequences unmasked:
  65. Homo_sapiens.GRCh37.dna.toplevel.fa.gz
  66. Toplevel soft/hard masked sequences:
  67. Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
  68. Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz
  69. -----------------
  70. PRIMARY ASSEMBLY
  71. -----------------
  72. Primary assembly contains all toplevel sequence regions excluding haplotypes
  73. and patches. This file is best used for performing sequence similarity searches
  74. where patch and haplotype sequences would confuse analysis. If the primary
  75. assembly file is not present, that indicates that there are no haplotype/patch
  76. regions, and the 'toplevel' file is equivalent.
  77. EXAMPLES
  78. Primary assembly sequences unmasked:
  79. Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
  80. Primary assembly soft/hard masked sequences:
  81. Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
  82. Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz
  83. --------------
  84. SPECIAL CASES
  85. --------------
  86. Some chromosomes have alternate haplotypes which are presented in files with
  87. the haplotype sequence only:
  88. Homo_sapiens.GRCh37.dna_rm.chromosome.HSCHR6_MHC_QBL.fa.gz
  89. Homo_sapiens.GRCh37.dna_rm.chromosome.HSCHR17_1.fa.gz
  90. All alternative assembly and patch regions have their sequence padded
  91. with N's to ensure alignment programs can report the correct index
  92. regions
  93. e.g. A patch region with a start position of 1,000,001 will have 1e6 N's added
  94. its start so an alignment program will report coordinates with respect to the
  95. whole chromosome.
  96. Human has sequenced Y chromosomes and the pseudoautosomal region (PAR)
  97. on the Y is annotated. By definition the PAR region is identical on the
  98. X and Y chromosome. The Y chromosome file contains the Y chromsome
  99. minus these repeated PAR regions i.e. the unique portion of Y.

练习

1fastq->fasta

zless -S SRR1039511_1_val_1.fq.gz |awk '{ if(NR%4==1){print">" substr($0,2)} if(NR%4==2){print}  }' | less -S

因为不同软件要求的输入不同格式

2.从gff或者gft文件中获取基因的ID与symbol对应关系,以及biotype类型

免包map

zless -S gtf.gz  |awk -F'\t' '{if($3=="gene"){print$9}}' |\
awk -F';' '{print$1,$3,$5}' |awk '{if($6!=""){print$2"\t"$4"\t"$6}}' |\
sed 's/"//g' > ID_symbol_biotype.txt

数据比对

概述

image.png

  • 跨越内含子
  • 不止一个片段比对到一个基因上
  • 碱基序列匹配
  • 需要索引

    Hisat2 比对软件

    image.png

    索引构建

    image.png
    ## ----构建索引
    # 进入参考基因组目录
    cd $HOME/database/GRCh38.105
    # Hisat2构建索引,构建索引时间比较长,建议提交后台运行,一般会运行20多分钟左右
    # vim Hisat2Index.sh
    mkdir Hisat2Index
    fa=Homo_sapiens.GRCh38.dna.primary_assembly.fa
    fa_baseName=GRCh38.dna
    hisat2-build -p 12 -f ${fa} Hisat2Index/${fa_baseName}
    # -p 可以设大一点,cpu多核的话
    # 运行
    nohup sh Hisat2Index.sh >Hisat2Index.sh.log &
    
    image.png
    sa文件是临时文件 不能删除

    单样本比对,sam->bam, bam建索引

    ```shell

    ——比对

    进入比对文件夹

    cd $HOME/project/Human-16-Asthma-Trans/Mapping/Hisat2

单个样本比对,步骤分解

index=/home/t_rna/database/GRCh38.104/Hisat2Index/GRCh38.dna # 索引 inputdir=$HOME/project/Human-16-Asthma-Trans/data/cleandata/trim_galore/ # 清洗后数据 outdir=$HOME/project/Human-16-Asthma-Trans/Mapping/Hisat2 # 比对结果

hisat2 -p 10 -x ${index} \ -1 ${inputdir}/SRR1039510_1_val_1.fq.gz \ -2 ${inputdir}/SRR1039510_2_val_2.fq.gz \ -S ${outdir}/SRR1039510.Hisat_aln.sam

sam转bam

samtools sort -@ 15 -o SRR1039510.Hisat_aln.sorted.bam SRR1039510.Hisat_aln.sam

对bam建索引

samtools index SRR1039510.Hisat_aln.sorted.bam SRR1039510.Hisat_aln.sorted.bam.bai

<a name="Ic9vu"></a>
### 批量 比对+sam->bam + bam建索引
&& 串联任务<br />![image.png](https://cdn.nlark.com/yuque/0/2022/png/22928625/1652971245323-796689a2-8d59-497b-b7c2-17f9d3f78f49.png#clientId=u290f7c54-e8cf-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=517&id=ua9c513db&margin=%5Bobject%20Object%5D&name=image.png&originHeight=646&originWidth=1256&originalType=binary&ratio=1&rotation=0&showTitle=false&size=106382&status=done&style=none&taskId=u9d5f90a7-40ac-4ab6-b8a5-81a02c3d64f&title=&width=1004.8)<br />![image.png](https://cdn.nlark.com/yuque/0/2022/png/22928625/1652972527791-bb86fb50-c401-428e-89e1-969541de37d1.png#clientId=u290f7c54-e8cf-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=488&id=uddb21d7b&margin=%5Bobject%20Object%5D&name=image.png&originHeight=610&originWidth=1155&originalType=binary&ratio=1&rotation=0&showTitle=false&size=548517&status=done&style=none&taskId=u7ee48054-10ef-4f58-bebf-e5d2cedb5a1&title=&width=924)
```python
# 多个样本批量进行比对,排序,建索引
# Hisat.sh内容: 注意命令中的-,表示占位符,表示|管道符前面的输出。
## 此处索引直接使用服务器上已经构建好的进行练习
# vim Hisat.sh
index=/home/t_rna/database/GRCh38.104/Hisat2Index/GRCh38.dna # 索引
inputdir=$HOME/project/Human-16-Asthma-Trans/data/cleandata/trim_galore/ # 洗后数据文件夹
outdir=$HOME/project/Human-16-Asthma-Trans/Mapping/Hisat2 # 结果目录

cat ../../data/cleandata/trim_galore/ID | while read id 
do
hisat2 -p 10 -x ${index} -1 ${inputdir}/${id}_1_val_1.fq.gz -2 ${inputdir}/${id}_2_val_2.fq.gz 2>${id}.log  | samtools sort -@ 3 -o ${outdir}/${id}.Hisat_aln.sorted.bam - &&  samtools index ${outdir}/${id}.Hisat_aln.sorted.bam ${outdir}/${id}.Hisat_aln.sorted.bam.bai
done

# 提交后台运行
nohup sh Hisat.sh >Hisat.log &

# 统计比对情况
multiqc -o ./ SRR*log

结果质量关注点

image.png
一般80%以上,

练习:

image.png
image.png

比对结果 sam/bam

image.pngimage.png
image.png
image.png

Flag 是重要字段

image.png
flag: http://broadinstitute.github.io/picard/explain-flags.html
image.png

samtools flags 99

CIGAR

image.png

samtools查看 sam /bamimage.png

##----view查看bam文件
samtools view SRR1039510.Hisat_aln.sorted.bam |less
samtools view -H SRR1039510.Hisat_aln.sorted.bam #只看头部
samtools view -h SRR1039510.Hisat_aln.sorted.bam # 都看

##----index对bam文件建索引
samtools index SRR1039510.Hisat_aln.sorted.bam SRR1039510.Hisat_aln.sorted.bam.bai

##----flagstat统计比对结果
samtools flagstat -@ 3 SRR1039510.Hisat_aln.sorted.bam #-@ 线程

##----sort排序 sam转bam并排序
samtools sort -@ 3 -o SRR1039510.Hisat_aln.sorted.bam SRR1039510.Hisat_aln.sam

##----depth统计测序深度
# 得到的结果中,一共有3列以指标分隔符分隔的数据,第一列为染色体名称,第二列为位点,第三列为覆盖深度
samtools depth SRR1039510.Hisat_aln.sorted.bam >SRR1039510.Hisat_aln.sorted.bam.depth.txt

1    14729    1
1    14730    1
1    14731    1
1    14732    1

##----计算某一个基因的测序深度
# 找到基因的坐标
zless -S /home/t_rna/database/GRCh38.104/Homo_sapiens.GRCh38.104.chr.gff3.gz |awk '{if($3=="gene")print}' |grep 'ID=gene:ENSG00000186092' |awk '{print $1"\t"$4"\t"$5}' >ENSG00000186092.bed

$ cat ENSG00000186092.bed
1    65419    71585

samtools depth  -b  ENSG00000186092.bed SRR1039510.Hisat_aln.sorted.bam >ENSG00000186092.bed.depth

# 如何找到多比对的reads,flag值的理解
# (0x100) 代表着多比对情况,所以直接用samtools view -f 0x100可以提取 multiple比对的 情况

samtools view -f 0x100 SRR1039510.Hisat_aln.sam |head -n 2  

###例:比对到 7号染色体 的 104669496 下一片段在104669431 序列边长到125 因为考虑两个片段信息
# flag =  339  read paired (0x1) +    read mapped in proper pair (0x2)+    read reverse strand (0x10)+    first in pair (0x40)+    not primary alignment (0x100)

SRR1039510.8    339    7    104669496    1    60M    =    104669431    -125    AAGTTCAGTTCTGCAAGCAAGTATGCTGCTCTCTCTGTTGATGGTGAAGATGAAAATGAG    JGIGGIJJJJJJIJJJJJIIIIJJGIGIIIIIIHJIJJJJIIGIJIJIIIIIJJJJIIIH    AS:i:0    ZS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:60    YS:i:0    YT:Z:CP    NH:i:3
SRR1039510.8    419    7    104669431    1    63M    =    104669496    125    GCAAAAAGGATCAAGACTCCAGATCTGCACCTGAGCCAAAGAAACCTGAGGAAAATCCAGCTT
GJJJJJJJIJJIJJJJIJJJJJGHIIJJIJJJJJJJIGHIJIJJJJJJJIJJJJHHHFHHFFE    AS:i:0    ZS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:63    YS:i:0    YT:Z:CP    NH:i:3

subjunc

image.png
image.png

## ----构建索引
# 进入参考基因组目录
cd $HOME/database/GRCh38.105
# subjunc构建索引,构建索引时间比较长大约40分钟左右,建议携程sh脚本提交后台运行
## 后续索引可直接使用服务器上已经构建好的进行练习
# vim SubjuncIndex.sh
mkdir SubjuncIndex
fa=Homo_sapiens.GRCh38.dna.primary_assembly.fa
fa_baseName=GRCh38.dna
subread-buildindex -o SubjuncIndex/${fa_baseName} ${fa}

# 运行
nohup sh SubjuncIndex.sh >SubjuncIndex.sh.log &


## ----比对
# 进入文件夹
cd $HOME/project/Human-16-Asthma-Trans/Mapping/Subjunc

# vim subjunc.sh
index=/home/t_rna/database/GRCh38.104/SubjuncIndex/GRCh38.dna
inputdir=$HOME/project/Human-16-Asthma-Trans/data/cleandata/trim_galore
outdir=$HOME/project/Human-16-Asthma-Trans/Mapping/Subjunc

cat ../../data/cleandata/trim_galore/ID | while read id
do
  subjunc -T 10 -i ${index} -r ${inputdir}/${id}_1_val_1.fq.gz -R ${inputdir}/${id}_2_val_2.fq.gz -o ${outdir}/${id}.Subjunc.bam 1>${outdir}/${id}.Subjunc.log 2>&1 && samtools sort -@ 6 -o ${outdir}/${id}.Subjunc.sorted.bam  ${outdir}/${id}.Subjunc.bam && samtools index ${outdir}/${id}.Subjunc.sorted.bam ${outdir}/${id}.Subjunc.sorted.bam.bai
done
## 三任务串联
# 运行
nohup sh subjunc.sh >subjunc.log &

log: 主要看mapped 率
image.png