常见生物信息学格式介绍
fasta以及fastq文件

fasta格式最初来自FASTA软件包,也是一种文本格式,以单字符( single-letter codes)贮存核酸或者蛋白序列信息,允许在序列前加注释信息。由2部分信息组成:
gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGV
HLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIV
IGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
第一部分:以>号开始,紧接着序列的标识符 ,注意区分大小写,且不能出现空格,空格表示序列标识符结束; 随后是序列的描述信息。
第二部分:以序列本身信息,使用既定的核苷酸或氨基酸编码符号,大小写都可以。直到遇到下一个>结束。所有来源于NCBI的序列都有一个gi号“gi|gi_identifier”,gi号由数字组成,具有唯一性。一条核酸或者蛋白质改变了,将赋予一个新的gi号(这时序列的接收号可能不变)。gi号后面是序列的标识符,标识符由序列来源标识、序列标识(如接收号、名称等)等几部分组成,他们之间用“|”隔开,如果某项缺失,可以留空但是“|”不能省略。

fastq格式是一个文本格式用于贮存生物学序列及其相应质量值(通常是核酸序列的)。为了简介,这些序列以及质量信息使用ASCII字符标示。该格式最初由Sanger开发,目的是将FASTA序列与质量数据放到一起,目前已经成为高通量测序结果的事实标准。通常fastq文件中每一个序列含有4行信息(如下):

第一行:序列标识,以‘@’开头。格式比较自由,允许添加注释等相关的描述信息,描述信息以空格分开。如示图中描述信息加入了NCBI的另一个ID名称,及长度信息
第二行:表示序列信息,制表符或者空格不允许出现。一般是明确的DNA或者RNA字符,通常是大写,在一些文本文件中,小写或者大小写混杂或者含有gap符号是有特殊含义。
第三行:用于将测序序列和质量值内容分离开来。以‘+’开头,后面是描述信息等,或者什么也不加。如果“+”后面有内容,该内容与第一行“@”后的内容相同;
第四行:表示质量值,每个字符与第二行的碱基一一对应,按照一定规则转换为碱基质量得分,进而反映该碱基的错误率,因此字符数必须和第二行保持一致。
gff/gtf
GFF和GTF是两种最常用的数据库注释格式,在信息分析中建库时除了需要fasta文件一般还会需要这两种文件,提取需要的信息进行注释。

三驾马车 grep sed awk
grep:主要是文本搜索工具

-r 从目录中查找pattern
July8 15:29:05 ~/Data$ grep -r ATCGATC ././example.fa:ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG./example.fa:CGAAAATCGCGGTGAAAACCAACGATAAACGTATCGATCCGGTAGGTGCTTGCGTAGGTATGCGTGGCGC./example.fa:ACCTGGAACGTTGCCGCGTCCTGTTGCACCTCATCGATATCGATCCGATTGACGGCACCGATCCGGTTGA./example.fa:GTGTTCAACAAGATCGATCTGCTGGATAAGGTAGAAGCCGAAGAGAAAGCGAAAGCGATCGCTGAGGCGC./example.fq:TTTTGAACACATTCCCCTTCACCTTCAGGTACAGGCTGTGATACATGTGGCGATCGATCTTCTTAGATTCJuly8 15:29:17 ~/Data$ grep -r -n ATCGATC ././example.fa:4:ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG./example.fa:12:CGAAAATCGCGGTGAAAACCAACGATAAACGTATCGATCCGGTAGGTGCTTGCGTAGGTATGCGTGGCGC./example.fa:205:ACCTGGAACGTTGCCGCGTCCTGTTGCACCTCATCGATATCGATCCGATTGACGGCACCGATCCGGTTGA./example.fa:207:GTGTTCAACAAGATCGATCTGCTGGATAAGGTAGAAGCCGAAGAGAAAGCGAAAGCGATCGCTGAGGCGC./example.fq:1046:TTTTGAACACATTCCCCTTCACCTTCAGGTACAGGCTGTGATACATGTGGCGATCGATCTTCTTAGATTC
-w 把搜索内容作为一个单词来理解
前面一个未加-w参数,后面一个加了
-c 统计匹配上的行数
-v 输出未匹配上的行,下面的代码无法输出内容,即无匹配内容
-e 查找多个匹配内容,相当于或
zcat Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz | grep -e 'exon' -e 'mRNA' | less -SN1 Y ensembl exon 2784749 2784853 . + . Parent=transcript:ENST00000516032;Name=ENSE00002088309;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E2 Y ensembl_havana mRNA 2786855 2787682 . - . ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14773 Y ensembl_havana exon 2786855 2787682 . - . Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;e4 Y havana exon 2789827 2790328 . + . Parent=transcript:ENST00000454281;Name=ENSE00001772499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E5 Y havana exon 2827982 2828218 . + . Parent=transcript:ENST00000430735;Name=ENSE00001614266;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E6 Y havana exon 2828192 2828735 . - . Parent=transcript:ENST00000651710;Name=ENSE00003843322;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E7 Y havana exon 2829526 2829751 . - . Parent=transcript:ENST00000651710;Name=ENSE00003846102;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E8 Y havana exon 2840471 2840851 . - . Parent=transcript:ENST00000651710;Name=ENSE00003844499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E9 Y ensembl_havana mRNA 2841602 2867268 . + . ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS110 Y ensembl_havana exon 2841602 2841627 . + . Parent=transcript:ENST00000250784;Name=ENSE00002490412;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;ex11 Y ensembl_havana exon 2842165 2842242 . + . Parent=transcript:ENST00000250784;Name=ENSE00001709586;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exo12 Y ensembl_havana exon 2844077 2844257 . + . Parent=transcript:ENST00000250784;Name=ENSE00001738202;constitutive=0;ensembl_end_phase=1;ensembl_phase=0;exo13 Y ensembl_havana exon 2845646 2845743 . + . Parent=transcript:ENST00000250784;Name=ENSE00001602849;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exo14 Y ensembl_havana exon 2854600 2854771 . + . Parent=transcript:ENST00000250784;Name=ENSE00001601989;constitutive=0;ensembl_end_phase=1;ensembl_phase=0;exo15 Y ensembl_havana exon 2865088 2865245 . + . Parent=transcript:ENST00000250784;Name=ENSE00003667463;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exo16 Y ensembl_havana exon 2866793 2867268 . + . Parent=transcript:ENST00000250784;Name=ENSE00003636667;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;ex
如果一次要查找多个pattern,可以把多个pattern写入一个文件当中
使用 -f 参数从文件中读取需要查找的pattern
July8 15:44:21 ~/Data$ cat > filegeneUTRstart_codonstop_codon^CJuly8 15:45:29 ~/Data$ cat filegeneUTRstart_codonstop_codonless example.gtf | grep -w -f file | less -Schr1 ENSEMBL UTR 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-chr1 HAVANA gene 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-chr1 ENSEMBL UTR 2476 2584 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-chr1 ENSEMBL UTR 3084 4021 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-chr1 ENSEMBL start_codon 4022 4024 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_namchr1 ENSEMBL UTR 4226 4561 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000438504"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5chr1 ENSEMBL UTR 4226 4692 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000423562"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5chr1 HAVANA gene 4226 19433 . - . gene_id "ENSG00000227232"; transcript_id "ENSG00000227232"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5chr1 ENSEMBL stop_codon 4250 4252 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_namchr1 ENSEMBL UTR 4250 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-chr1 ENSEMBL stop_codon 4559 4561 . - 0 gene_id "ENSG00000227232"; transcript_id "ENST00000438504"; gene_type "protein_coding"; gene_status "KNOWN"; gene_namchr1 ENSEMBL UTR 4833 4901 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000423562"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5chr1 ENSEMBL UTR 5659 5810 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000423562"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5
正则表达式


查找 ) 结尾的行
查找以H开头的行
查找fe任意l的字符的行
利用正则表达式查找的时候,记得在你需要使用正则表达式表达的前方加上\,这样才能识别出正则表达式,否则系统会认为你的正则表达式是一个你想要寻找的普通字符。或者可以加上-E参数
?代表匹配前面项0次或一次,所以ee也被匹配出来
+代表匹配前面的字符一次或多次
*代表匹配0次或多次,所以Biotrainee也被匹配出来
{2}前面字符匹配到2次
[Bb]代表小写b或者大写B
[^Bb]排除小写b或者大写B
也可以用 | 表示或
sed:流编辑器,一般用来对文本进行增删改查
屏幕的输出内容叫标准输出流,,所以sed也叫流编辑器
可以处理多个文件


July8 20:10:54 ~$ cat readme.txtWelcome to Biotrainee() !This is your personal account in our Cloud.Have a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)July8 20:11:00 ~$ cat readme.txt | sed '2a i am beautful'Welcome to Biotrainee() !This is your personal account in our Cloud.i am beautfulHave a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# 在第二行的后面加上一句i am beautfulJuly8 20:12:07 ~$ cat readme.txt | sed '3i i am beautful'Welcome to Biotrainee() !This is your personal account in our Cloud.i am beautfulHave a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# 在第三行的前面加上一句i am beautfulJuly8 20:12:35 ~$ cat readme.txt | sed '1,2i i am beautful'i am beautfulWelcome to Biotrainee() !i am beautfulThis is your personal account in our Cloud.Have a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# 在1到2行每一行的前面加上一句i am beautfulJuly8 20:15:40 ~$ cat readme.txt | sed '2,5i i am beautful'Welcome to Biotrainee() !i am beautfulThis is your personal account in our Cloud.i am beautfulHave a fun with it.i am beautfulPlease feel free to contact with me( email to jmzeng1314@163.com )i am beautful(http://www.biotrainee.com/thread-1376-1-1.html)# 在2到5行的每一行的前面加上一句i am beautfulJuly8 20:18:34 ~$ cat readme.txt | sed '2,4d'Welcome to Biotrainee() !(http://www.biotrainee.com/thread-1376-1-1.html)# 删除2-4行July8 20:18:46 ~$ cat readme.txt | sed -e '1a UUUUUUUU' -e '3i KKKKKKKKKKK'Welcome to Biotrainee() !UUUUUUUUThis is your personal account in our Cloud.KKKKKKKKKKKHave a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# -e参数可以一次性进行多个操作$ cat readme.txt | sed '2,5c OOOOOOOOOOOOOO'Welcome to Biotrainee() !OOOOOOOOOOOOOO# 将第二到第五行直接换成OOOOOOOOOOOOOOJuly8 20:21:49 ~$ cat readme.txt | sed '2,5c OOOOOOOOOOOOOO\nOOOOOOOOOOOOOO\nOOOOOOOOOOOOOO'Welcome to Biotrainee() !OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOJuly8 20:23:07 ~$ cat readme.txt | sed -e '2,5i OOOOOOOOOOOOOO ' -e '2,5d'Welcome to Biotrainee() !OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO# 两个将2-5行替换成OOOOOOOOOOOOOO的方法July8 20:31:31 ~$ cat readme.txt | sed s/s/S/gWelcome to Biotrainee() !ThiS iS your perSonal account in our Cloud.Have a fun with it.PleaSe feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# 试了试不加引号也可以,后面g,即global 全部换July8 20:31:50 ~$ cat readme.txt | sed '2 s/s/S/'Welcome to Biotrainee() !ThiS is your personal account in our Cloud.Have a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)#还是加引号把,换第二行,只会换第一个July8 20:32:24 ~$ cat readme.txt | sed '2 s/s/S/2'Welcome to Biotrainee() !This iS your personal account in our Cloud.Have a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# 替换掉第二行的第二个s变成SJuly8 20:32:53 ~$ cat readme.txt | sed '1~3 s/o/****/'Welc****me to Biotrainee() !This is your personal account in our Cloud.Have a fun with it.Please feel free t**** contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)#从第一行开始,每隔三行进行o到****的替换,即 第1-4-7-10。。。行July8 20:35:47 ~$ cat readme.txt | sed '/Please/ s/o/****/g'Welcome to Biotrainee() !This is your personal account in our Cloud.Have a fun with it.Please feel free t**** c****ntact with me( email t**** jmzeng1314@163.c****m )(http://www.biotrainee.com/thread-1376-1-1.html)# address也可以用//来定位,替换有Please那行的全部o变成****$ cat Data/example.fq | sed -n '1~4 p'| head@ERR329499.1 HWUSI-EAS697:8:115:13414:19955#ACAGTG/1@ERR329499.2 HWUSI-EAS697:8:116:12001:8002#ACAGTG/1@ERR329499.3 HWUSI-EAS697:8:109:15856:9893#ACAGTG/1@ERR329499.4 HWUSI-EAS697:8:112:11677:17310#ACAGTG/1@ERR329499.5 HWUSI-EAS697:8:107:15127:3214#ACAGTG/1@ERR329499.6 HWUSI-EAS697:8:107:2618:15051#ACAGTG/1@ERR329499.7 HWUSI-EAS697:8:115:16789:7248#ACAGTG/1@ERR329499.8 HWUSI-EAS697:8:109:5676:19198#ACAGTG/1@ERR329499.9 HWUSI-EAS697:8:118:11989:2132#ACAGTG/1@ERR329499.10 HWUSI-EAS697:8:109:2951:9799#ACAGTG/1#第一行开始,每隔4行打印出来July8 20:43:38 ~$ cat readme.txt | sed -n '/ee/p'Welcome to Biotrainee() !Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)# 查找ee所在的行并打印出来July8 20:43:47 ~$ cat readme.txt | sed '/ee/p'Welcome to Biotrainee() !Welcome to Biotrainee() !This is your personal account in our Cloud.Have a fun with it.Please feel free to contact with me( email to jmzeng1314@163.com )Please feel free to contact with me( email to jmzeng1314@163.com )(http://www.biotrainee.com/thread-1376-1-1.html)(http://www.biotrainee.com/thread-1376-1-1.html)# 不加-n就会所有内容打印两遍,因为还有默认输出。#-n :即no 取消默认输出,只显示经过sed处理或匹配的行(常用)July8 20:44:07 ~$ cat readme.txt | sed 's/ee/888888888/p'Welcome to Biotrain888888888() !Welcome to Biotrain888888888() !This is your personal account in our Cloud.Have a fun with it.Please f888888888l free to contact with me( email to jmzeng1314@163.com )Please f888888888l free to contact with me( email to jmzeng1314@163.com )(http://www.biotrain888888888.com/thread-1376-1-1.html)(http://www.biotrain888888888.com/thread-1376-1-1.html)#未加n,将所有修改行都打印两遍July8 20:46:56 ~$ cat readme.txt | sed 's/ee/888888888/'Welcome to Biotrain888888888() !This is your personal account in our Cloud.Have a fun with it.Please f888888888l free to contact with me( email to jmzeng1314@163.com )(http://www.biotrain888888888.com/thread-1376-1-1.html)# 不加p就打印所有内容,包括修改行July8 20:47:29 ~$ cat readme.txt | sed -n 's/ee/888888888/p'Welcome to Biotrain888888888() !Please f888888888l free to contact with me( email to jmzeng1314@163.com )(http://www.biotrain888888888.com/thread-1376-1-1.html)# 加n只输出修改行July8 20:52:20 ~$ cat readme.txt | tr 'a-z' 'A-Z'WELCOME TO BIOTRAINEE() !THIS IS YOUR PERSONAL ACCOUNT IN OUR CLOUD.HAVE A FUN WITH IT.PLEASE FEEL FREE TO CONTACT WITH ME( EMAIL TO JMZENG1314@163.COM )(HTTP://WWW.BIOTRAINEE.COM/THREAD-1376-1-1.HTML)# tr命令将小写换成大写July8 20:52:37 ~$ cat readme.txt | sed 'y/abcde/ABCDE/'WElComE to BiotrAinEE() !This is your pErsonAl ACCount in our ClouD.HAvE A fun with it.PlEAsE fEEl frEE to ContACt with mE( EmAil to jmzEng1314@163.Com )(http://www.BiotrAinEE.Com/thrEAD-1376-1-1.html)# sed将5个小写替换成大写,且为1对1的替换
awk,编程语言,可对文本和数据进行处理




用cut和awk同样取第9列会输出不一样的结果,因为awk默认把空格识别成分隔符
-F参数指定分隔符为tab键,输出结果就和cut一样了

取第9和第10列
July8 21:22:24 ~/Data$ cat example.gtf | awk '{print $1,$3,$5,$10}' | headchr1 UTR 2090 "ENSG00000223972";chr1 exon 2090 "ENSG00000223972";chr1 transcript 4275 "ENSG00000223972";chr1 gene 4275 "ENSG00000223972";chr1 exon 1920 "ENSG00000223972";chr1 transcript 3533 "ENSG00000223972";chr1 exon 2090 "ENSG00000223972";chr1 exon 2560 "ENSG00000223972";chr1 UTR 2584 "ENSG00000223972";chr1 exon 2584 "ENSG00000223972";July8 21:22:56 ~/Data#取第1、3、5、10列$ cat example.gtf | awk '{print $3,$5,$10,$1}' | headUTR 2090 "ENSG00000223972"; chr1exon 2090 "ENSG00000223972"; chr1transcript 4275 "ENSG00000223972"; chr1gene 4275 "ENSG00000223972"; chr1exon 1920 "ENSG00000223972"; chr1transcript 3533 "ENSG00000223972"; chr1exon 2090 "ENSG00000223972"; chr1exon 2560 "ENSG00000223972"; chr1UTR 2584 "ENSG00000223972"; chr1exon 2584 "ENSG00000223972"; chr1July8 21:23:14 ~/Data#也可以改变输出顺序,cut取的话只能按顺序取$ cat example.gtf | awk '{print $1,$3,$5,$10,$1}' | headchr1 UTR 2090 "ENSG00000223972"; chr1chr1 exon 2090 "ENSG00000223972"; chr1chr1 transcript 4275 "ENSG00000223972"; chr1chr1 gene 4275 "ENSG00000223972"; chr1chr1 exon 1920 "ENSG00000223972"; chr1chr1 transcript 3533 "ENSG00000223972"; chr1chr1 exon 2090 "ENSG00000223972"; chr1chr1 exon 2560 "ENSG00000223972"; chr1chr1 UTR 2584 "ENSG00000223972"; chr1chr1 exon 2584 "ENSG00000223972"; chr1#还可以重复取

匹配所有含有UTR的行,并且输出这些行

开始打印一个find UTR 巴啦啦,然后匹配所有含有UTR的行,并且输出这些行,最后打印end
awk的内置变量

$ less -S example.gtf | awk '{print $3,$4,$5}'| headUTR 1737 2090exon 1737 2090transcript 1737 4275gene 1737 4275exon 1873 1920transcript 1873 3533exon 2042 2090exon 2476 2560UTR 2476 2584exon 2476 2584# 打印出3、4、5列$ less -S example.gtf | awk 'BEGIN{OFS=":"} {print $3,$4,$5}'| headUTR:1737:2090exon:1737:2090transcript:1737:4275gene:1737:4275exon:1873:1920transcript:1873:3533exon:2042:2090exon:2476:2560UTR:2476:2584exon:2476:2584#将列与列之间的分隔符号换成: !!!记得一定要写双引号July8 22:22:08 ~/Data$ less -S example.gtf | awk '{print $3":"$4"\t"$5}'| head | cat -AUTR:1737^I2090$exon:1737^I2090$transcript:1737^I4275$gene:1737^I4275$exon:1873^I1920$transcript:1873^I3533$exon:2042^I2090$exon:2476^I2560$UTR:2476^I2584$exon:2476^I2584$# 也可以直接在打印的时候加上分隔符$ less -S example.gtf | awk '{print $3":"$4"\t"$5}'| headUTR:1737 2090exon:1737 2090transcript:1737 4275gene:1737 4275exon:1873 1920transcript:1873 3533exon:2042 2090exon:2476 2560UTR:2476 2584exon:2476 2584# 正常显示

输入的分隔符为tab,NR为行号
awk的条件和循环语句


判断的语句写在()内
是就打印出整行,不是就打印出第三行+is not gene

for循环,这个是按照每一行来循环,表示从第一行开始,输出第一行的第一列,然后 i++,就输出第一行的第二列,再 i++,就输出第一行的第三列。再从第二行开始。。。。。
paste 给三个座位


计算外显子长度
除法
取整
和四舍五入不对,所以可以对每个结果+0.5,然后再取整

练习题
引自生信技能树
