fasta

fasta:一种基于文本用于表示核酸序列或多肽序列的格式,缩写为fa(通常是参考序列)
特征: 两部分,id行和序列行.

  • id行:以“>”开头, 有时候会包含注释信息,如chr1、chr2 …
  • 序列行:一个字母表示一个碱基/氨基酸,ATCG 或20种氨基酸 ```bash $ less -S Data/example.fa| head -n 5

    gi|556503834|ref|NC_000913.3| Escherichia coli str. K-12 substr. MG1655, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

  1. <a name="eZsJ2"></a>
  2. ## fastq
  3. fastq:一种保存生物序列(通常为核酸序列)及其测序质量得分信息的文本格式。缩写为fq<br />FASTQ文件中,一个序列通常由四行组成:<br />• 第一行:以@ 开头,之后为序列的标识符以及描述信息<br />• 第二行:为序列信息,如ATCG<br />• 第三行:以+ 开头,之后可以再次加上序列的标识及描述信息(保留行)<br />• 第四行:为碱基质量值,与第二行的序列相对应,长度必须与第二行相同
  4. ```bash
  5. $ head Data/example.fq -n 8
  6. @ERR329499.1 HWUSI-EAS697:8:115:13414:19955#ACAGTG/1
  7. AAAAAATTGGTGTTATAAGACTTCTGGACCCTGAAGATGTCGATGTCTCCTCACCTGATGAAAAATCAGT
  8. +
  9. HIIIIIIHIIHIHIIIGEIIIIIIIIIIIIIIHEHIGIIHHHIIIHIGIIIIIIGGIEHIDEIHBEBEFB
  10. @ERR329499.2 HWUSI-EAS697:8:116:12001:8002#ACAGTG/1
  11. CATGTTGTCACTTTTTCCATGAGCCACGTAGTACAGAGAACGCGGCACTCCATAAGGACCATTTGTCCTG
  12. +
  13. GGEECDGGE@GGGGGGGGBGEDBGGHHGHGEBGDDDB@DGHDHFBGBDBDD@D2DCECEB@>?C@BECEC

gtf(Gene transfer format)

gtf:最常用的基因组注释文件,总共有9列。分别为
image.png
GTF文件示例:

$ head Data/example.gtf |column -t| less -S

chr1  ENSEMBL  UTR         1737  2090  .  +  .  gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  gene_type  "protein_coding";  gene_status  "KNOWN";  gene_name  "RP11-34P13.
chr1  ENSEMBL  exon        1737  2090  .  +  .  gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  gene_type  "protein_coding";  gene_status  "KNOWN";  gene_name  "RP11-34P13.
chr1  ENSEMBL  transcript  1737  4275  .  +  .  gene_id  "ENSG00000223972";  transcript_id  "ENST00000456328";  gene_type  "protein_coding";  gene_status  "KNOWN";  gene_name  "RP11-34P13.
chr1  HAVANA   gene        1737  4275  .  +  .  gene_id  "ENSG00000223972";  transcript_id  "ENSG00000223972";  gene_type  "protein_coding";  gene_status  "KNOWN";  gene_name  "RP11-34P13.
chr1  HAVANA   exon        1873  1920  .  +  .  gene_id  "ENSG00000223972";  transcript_id  "ENST00000450305";  gene_type  "protein_coding";  gene_status  "KNOWN";  gene_name  "RP11-34P13.