fasta
fasta:一种基于文本用于表示核酸序列或多肽序列的格式,缩写为fa(通常是参考序列)
特征: 两部分,id行和序列行.
- id行:以“>”开头, 有时候会包含注释信息,如chr1、chr2 …
- 序列行:一个字母表示一个碱基/氨基酸,ATCG 或20种氨基酸
```bash
$ less -S Data/example.fa| head -n 5
gi|556503834|ref|NC_000913.3| Escherichia coli str. K-12 substr. MG1655, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
<a name="eZsJ2"></a>
## fastq
fastq:一种保存生物序列(通常为核酸序列)及其测序质量得分信息的文本格式。缩写为fq<br />FASTQ文件中,一个序列通常由四行组成:<br />• 第一行:以@ 开头,之后为序列的标识符以及描述信息<br />• 第二行:为序列信息,如ATCG<br />• 第三行:以+ 开头,之后可以再次加上序列的标识及描述信息(保留行)<br />• 第四行:为碱基质量值,与第二行的序列相对应,长度必须与第二行相同
```bash
$ head Data/example.fq -n 8
@ERR329499.1 HWUSI-EAS697:8:115:13414:19955#ACAGTG/1
AAAAAATTGGTGTTATAAGACTTCTGGACCCTGAAGATGTCGATGTCTCCTCACCTGATGAAAAATCAGT
+
HIIIIIIHIIHIHIIIGEIIIIIIIIIIIIIIHEHIGIIHHHIIIHIGIIIIIIGGIEHIDEIHBEBEFB
@ERR329499.2 HWUSI-EAS697:8:116:12001:8002#ACAGTG/1
CATGTTGTCACTTTTTCCATGAGCCACGTAGTACAGAGAACGCGGCACTCCATAAGGACCATTTGTCCTG
+
GGEECDGGE@GGGGGGGGBGEDBGGHHGHGEBGDDDB@DGHDHFBGBDBDD@D2DCECEB@>?C@BECEC
gtf(Gene transfer format)
gtf:最常用的基因组注释文件,总共有9列。分别为
GTF文件示例:
$ head Data/example.gtf |column -t| less -S
chr1 ENSEMBL UTR 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.
chr1 ENSEMBL exon 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.
chr1 ENSEMBL transcript 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.
chr1 HAVANA gene 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.
chr1 HAVANA exon 1873 1920 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.