常见生物信息学格式介绍

fasta以及fastq文件

image.png
fasta格式最初来自FASTA软件包,也是一种文本格式,以单字符( single-letter codes)贮存核酸或者蛋白序列信息,允许在序列前加注释信息。由2部分信息组成:

gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGV
HLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIV
IGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
第一部分:以>号开始,紧接着序列的标识符 ,注意区分大小写,且不能出现空格,空格表示序列标识符结束; 随后是序列的描述信息。
第二部分:以序列本身信息,使用既定的核苷酸或氨基酸编码符号,大小写都可以。直到遇到下一个>结束。所有来源于NCBI的序列都有一个gi号“gi|gi_identifier”,gi号由数字组成,具有唯一性。一条核酸或者蛋白质改变了,将赋予一个新的gi号(这时序列的接收号可能不变)。gi号后面是序列的标识符,标识符由序列来源标识、序列标识(如接收号、名称等)等几部分组成,他们之间用“|”隔开,如果某项缺失,可以留空但是“|”不能省略。

image.png
fastq格式是一个文本格式用于贮存生物学序列及其相应质量值(通常是核酸序列的)。为了简介,这些序列以及质量信息使用ASCII字符标示。该格式最初由Sanger开发,目的是将FASTA序列与质量数据放到一起,目前已经成为高通量测序结果的事实标准。通常fastq文件中每一个序列含有4行信息(如下):
linux学习(第3节课) - 图3
第一行:序列标识,以‘@’开头。格式比较自由,允许添加注释等相关的描述信息,描述信息以空格分开。如示图中描述信息加入了NCBI的另一个ID名称,及长度信息
第二行:表示序列信息,制表符或者空格不允许出现。一般是明确的DNA或者RNA字符,通常是大写,在一些文本文件中,小写或者大小写混杂或者含有gap符号是有特殊含义。
第三行:用于将测序序列和质量值内容分离开来。以‘+’开头,后面是描述信息等,或者什么也不加。如果“+”后面有内容,该内容与第一行“@”后的内容相同;
第四行:表示质量值,每个字符与第二行的碱基一一对应,按照一定规则转换为碱基质量得分,进而反映该碱基的错误率,因此字符数必须和第二行保持一致。

gff/gtf

GFF和GTF是两种最常用的数据库注释格式,在信息分析中建库时除了需要fasta文件一般还会需要这两种文件,提取需要的信息进行注释。
image.png
image.png

三驾马车 grep sed awk

grep:主要是文本搜索工具

image.png
-r 从目录中查找pattern

  1. July8 15:29:05 ~/Data
  2. $ grep -r ATCGATC ./
  3. ./example.fa:ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG
  4. ./example.fa:CGAAAATCGCGGTGAAAACCAACGATAAACGTATCGATCCGGTAGGTGCTTGCGTAGGTATGCGTGGCGC
  5. ./example.fa:ACCTGGAACGTTGCCGCGTCCTGTTGCACCTCATCGATATCGATCCGATTGACGGCACCGATCCGGTTGA
  6. ./example.fa:GTGTTCAACAAGATCGATCTGCTGGATAAGGTAGAAGCCGAAGAGAAAGCGAAAGCGATCGCTGAGGCGC
  7. ./example.fq:TTTTGAACACATTCCCCTTCACCTTCAGGTACAGGCTGTGATACATGTGGCGATCGATCTTCTTAGATTC
  8. July8 15:29:17 ~/Data
  9. $ grep -r -n ATCGATC ./
  10. ./example.fa:4:ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG
  11. ./example.fa:12:CGAAAATCGCGGTGAAAACCAACGATAAACGTATCGATCCGGTAGGTGCTTGCGTAGGTATGCGTGGCGC
  12. ./example.fa:205:ACCTGGAACGTTGCCGCGTCCTGTTGCACCTCATCGATATCGATCCGATTGACGGCACCGATCCGGTTGA
  13. ./example.fa:207:GTGTTCAACAAGATCGATCTGCTGGATAAGGTAGAAGCCGAAGAGAAAGCGAAAGCGATCGCTGAGGCGC
  14. ./example.fq:1046:TTTTGAACACATTCCCCTTCACCTTCAGGTACAGGCTGTGATACATGTGGCGATCGATCTTCTTAGATTC

-w 把搜索内容作为一个单词来理解
前面一个未加-w参数,后面一个加了
image.png
-c 统计匹配上的行数
image.png
-v 输出未匹配上的行,下面的代码无法输出内容,即无匹配内容
image.png
-e 查找多个匹配内容,相当于或
image.png

  1. zcat Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz | grep -e 'exon' -e 'mRNA' | less -SN
  2. 1 Y ensembl exon 2784749 2784853 . + . Parent=transcript:ENST00000516032;Name=ENSE00002088309;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E
  3. 2 Y ensembl_havana mRNA 2786855 2787682 . - . ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS1477
  4. 3 Y ensembl_havana exon 2786855 2787682 . - . Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;e
  5. 4 Y havana exon 2789827 2790328 . + . Parent=transcript:ENST00000454281;Name=ENSE00001772499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E
  6. 5 Y havana exon 2827982 2828218 . + . Parent=transcript:ENST00000430735;Name=ENSE00001614266;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E
  7. 6 Y havana exon 2828192 2828735 . - . Parent=transcript:ENST00000651710;Name=ENSE00003843322;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E
  8. 7 Y havana exon 2829526 2829751 . - . Parent=transcript:ENST00000651710;Name=ENSE00003846102;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E
  9. 8 Y havana exon 2840471 2840851 . - . Parent=transcript:ENST00000651710;Name=ENSE00003844499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=E
  10. 9 Y ensembl_havana mRNA 2841602 2867268 . + . ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS1
  11. 10 Y ensembl_havana exon 2841602 2841627 . + . Parent=transcript:ENST00000250784;Name=ENSE00002490412;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;ex
  12. 11 Y ensembl_havana exon 2842165 2842242 . + . Parent=transcript:ENST00000250784;Name=ENSE00001709586;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exo
  13. 12 Y ensembl_havana exon 2844077 2844257 . + . Parent=transcript:ENST00000250784;Name=ENSE00001738202;constitutive=0;ensembl_end_phase=1;ensembl_phase=0;exo
  14. 13 Y ensembl_havana exon 2845646 2845743 . + . Parent=transcript:ENST00000250784;Name=ENSE00001602849;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exo
  15. 14 Y ensembl_havana exon 2854600 2854771 . + . Parent=transcript:ENST00000250784;Name=ENSE00001601989;constitutive=0;ensembl_end_phase=1;ensembl_phase=0;exo
  16. 15 Y ensembl_havana exon 2865088 2865245 . + . Parent=transcript:ENST00000250784;Name=ENSE00003667463;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exo
  17. 16 Y ensembl_havana exon 2866793 2867268 . + . Parent=transcript:ENST00000250784;Name=ENSE00003636667;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;ex

如果一次要查找多个pattern,可以把多个pattern写入一个文件当中
使用 -f 参数从文件中读取需要查找的pattern
image.png

  1. July8 15:44:21 ~/Data
  2. $ cat > file
  3. gene
  4. UTR
  5. start_codon
  6. stop_codon
  7. ^C
  8. July8 15:45:29 ~/Data
  9. $ cat file
  10. gene
  11. UTR
  12. start_codon
  13. stop_codon
  14. less example.gtf | grep -w -f file | less -S
  15. chr1 ENSEMBL UTR 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-
  16. chr1 HAVANA gene 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-
  17. chr1 ENSEMBL UTR 2476 2584 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-
  18. chr1 ENSEMBL UTR 3084 4021 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-
  19. chr1 ENSEMBL start_codon 4022 4024 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_nam
  20. chr1 ENSEMBL UTR 4226 4561 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000438504"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5
  21. chr1 ENSEMBL UTR 4226 4692 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000423562"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5
  22. chr1 HAVANA gene 4226 19433 . - . gene_id "ENSG00000227232"; transcript_id "ENSG00000227232"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5
  23. chr1 ENSEMBL stop_codon 4250 4252 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_nam
  24. chr1 ENSEMBL UTR 4250 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-
  25. chr1 ENSEMBL stop_codon 4559 4561 . - 0 gene_id "ENSG00000227232"; transcript_id "ENST00000438504"; gene_type "protein_coding"; gene_status "KNOWN"; gene_nam
  26. chr1 ENSEMBL UTR 4833 4901 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000423562"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5
  27. chr1 ENSEMBL UTR 5659 5810 . - . gene_id "ENSG00000227232"; transcript_id "ENST00000423562"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "WASH5

正则表达式

image.png

image.png
查找 ) 结尾的行
查找以H开头的行
查找fe任意l的字符的行
image.png

利用正则表达式查找的时候,记得在你需要使用正则表达式表达的前方加上\,这样才能识别出正则表达式,否则系统会认为你的正则表达式是一个你想要寻找的普通字符。或者可以加上-E参数
image.png
?代表匹配前面项0次或一次,所以ee也被匹配出来
image.png
+代表匹配前面的字符一次或多次
image.png
*代表匹配0次或多次,所以Biotrainee也被匹配出来
image.png
{2}前面字符匹配到2次
image.png
[Bb]代表小写b或者大写B
image.png
[^Bb]排除小写b或者大写B
image.png
也可以用 | 表示或

sed:流编辑器,一般用来对文本进行增删改查

屏幕的输出内容叫标准输出流,,所以sed也叫流编辑器
可以处理多个文件
image.png
image.png
image.png

  1. July8 20:10:54 ~
  2. $ cat readme.txt
  3. Welcome to Biotrainee() !
  4. This is your personal account in our Cloud.
  5. Have a fun with it.
  6. Please feel free to contact with me( email to jmzeng1314@163.com )
  7. (http://www.biotrainee.com/thread-1376-1-1.html)
  8. July8 20:11:00 ~
  9. $ cat readme.txt | sed '2a i am beautful'
  10. Welcome to Biotrainee() !
  11. This is your personal account in our Cloud.
  12. i am beautful
  13. Have a fun with it.
  14. Please feel free to contact with me( email to jmzeng1314@163.com )
  15. (http://www.biotrainee.com/thread-1376-1-1.html)
  16. # 在第二行的后面加上一句i am beautful
  17. July8 20:12:07 ~
  18. $ cat readme.txt | sed '3i i am beautful'
  19. Welcome to Biotrainee() !
  20. This is your personal account in our Cloud.
  21. i am beautful
  22. Have a fun with it.
  23. Please feel free to contact with me( email to jmzeng1314@163.com )
  24. (http://www.biotrainee.com/thread-1376-1-1.html)
  25. # 在第三行的前面加上一句i am beautful
  26. July8 20:12:35 ~
  27. $ cat readme.txt | sed '1,2i i am beautful'
  28. i am beautful
  29. Welcome to Biotrainee() !
  30. i am beautful
  31. This is your personal account in our Cloud.
  32. Have a fun with it.
  33. Please feel free to contact with me( email to jmzeng1314@163.com )
  34. (http://www.biotrainee.com/thread-1376-1-1.html)
  35. # 在1到2行每一行的前面加上一句i am beautful
  36. July8 20:15:40 ~
  37. $ cat readme.txt | sed '2,5i i am beautful'
  38. Welcome to Biotrainee() !
  39. i am beautful
  40. This is your personal account in our Cloud.
  41. i am beautful
  42. Have a fun with it.
  43. i am beautful
  44. Please feel free to contact with me( email to jmzeng1314@163.com )
  45. i am beautful
  46. (http://www.biotrainee.com/thread-1376-1-1.html)
  47. # 在2到5行的每一行的前面加上一句i am beautful
  48. July8 20:18:34 ~
  49. $ cat readme.txt | sed '2,4d'
  50. Welcome to Biotrainee() !
  51. (http://www.biotrainee.com/thread-1376-1-1.html)
  52. # 删除2-4行
  53. July8 20:18:46 ~
  54. $ cat readme.txt | sed -e '1a UUUUUUUU' -e '3i KKKKKKKKKKK'
  55. Welcome to Biotrainee() !
  56. UUUUUUUU
  57. This is your personal account in our Cloud.
  58. KKKKKKKKKKK
  59. Have a fun with it.
  60. Please feel free to contact with me( email to jmzeng1314@163.com )
  61. (http://www.biotrainee.com/thread-1376-1-1.html)
  62. # -e参数可以一次性进行多个操作
  63. $ cat readme.txt | sed '2,5c OOOOOOOOOOOOOO'
  64. Welcome to Biotrainee() !
  65. OOOOOOOOOOOOOO
  66. # 将第二到第五行直接换成OOOOOOOOOOOOOO
  67. July8 20:21:49 ~
  68. $ cat readme.txt | sed '2,5c OOOOOOOOOOOOOO\nOOOOOOOOOOOOOO\nOOOOOOOOOOOOOO'
  69. Welcome to Biotrainee() !
  70. OOOOOOOOOOOOOO
  71. OOOOOOOOOOOOOO
  72. OOOOOOOOOOOOOO
  73. July8 20:23:07 ~
  74. $ cat readme.txt | sed -e '2,5i OOOOOOOOOOOOOO ' -e '2,5d'
  75. Welcome to Biotrainee() !
  76. OOOOOOOOOOOOOO
  77. OOOOOOOOOOOOOO
  78. OOOOOOOOOOOOOO
  79. OOOOOOOOOOOOOO
  80. # 两个将2-5行替换成OOOOOOOOOOOOOO的方法
  81. July8 20:31:31 ~
  82. $ cat readme.txt | sed s/s/S/g
  83. Welcome to Biotrainee() !
  84. ThiS iS your perSonal account in our Cloud.
  85. Have a fun with it.
  86. PleaSe feel free to contact with me( email to jmzeng1314@163.com )
  87. (http://www.biotrainee.com/thread-1376-1-1.html)
  88. # 试了试不加引号也可以,后面g,即global 全部换
  89. July8 20:31:50 ~
  90. $ cat readme.txt | sed '2 s/s/S/'
  91. Welcome to Biotrainee() !
  92. ThiS is your personal account in our Cloud.
  93. Have a fun with it.
  94. Please feel free to contact with me( email to jmzeng1314@163.com )
  95. (http://www.biotrainee.com/thread-1376-1-1.html)
  96. #还是加引号把,换第二行,只会换第一个
  97. July8 20:32:24 ~
  98. $ cat readme.txt | sed '2 s/s/S/2'
  99. Welcome to Biotrainee() !
  100. This iS your personal account in our Cloud.
  101. Have a fun with it.
  102. Please feel free to contact with me( email to jmzeng1314@163.com )
  103. (http://www.biotrainee.com/thread-1376-1-1.html)
  104. # 替换掉第二行的第二个s变成S
  105. July8 20:32:53 ~
  106. $ cat readme.txt | sed '1~3 s/o/****/'
  107. Welc****me to Biotrainee() !
  108. This is your personal account in our Cloud.
  109. Have a fun with it.
  110. Please feel free t**** contact with me( email to jmzeng1314@163.com )
  111. (http://www.biotrainee.com/thread-1376-1-1.html)
  112. #从第一行开始,每隔三行进行o到****的替换,即 第1-4-7-10。。。行
  113. July8 20:35:47 ~
  114. $ cat readme.txt | sed '/Please/ s/o/****/g'
  115. Welcome to Biotrainee() !
  116. This is your personal account in our Cloud.
  117. Have a fun with it.
  118. Please feel free t**** c****ntact with me( email t**** jmzeng1314@163.c****m )
  119. (http://www.biotrainee.com/thread-1376-1-1.html)
  120. # address也可以用//来定位,替换有Please那行的全部o变成****
  121. $ cat Data/example.fq | sed -n '1~4 p'| head
  122. @ERR329499.1 HWUSI-EAS697:8:115:13414:19955#ACAGTG/1
  123. @ERR329499.2 HWUSI-EAS697:8:116:12001:8002#ACAGTG/1
  124. @ERR329499.3 HWUSI-EAS697:8:109:15856:9893#ACAGTG/1
  125. @ERR329499.4 HWUSI-EAS697:8:112:11677:17310#ACAGTG/1
  126. @ERR329499.5 HWUSI-EAS697:8:107:15127:3214#ACAGTG/1
  127. @ERR329499.6 HWUSI-EAS697:8:107:2618:15051#ACAGTG/1
  128. @ERR329499.7 HWUSI-EAS697:8:115:16789:7248#ACAGTG/1
  129. @ERR329499.8 HWUSI-EAS697:8:109:5676:19198#ACAGTG/1
  130. @ERR329499.9 HWUSI-EAS697:8:118:11989:2132#ACAGTG/1
  131. @ERR329499.10 HWUSI-EAS697:8:109:2951:9799#ACAGTG/1
  132. #第一行开始,每隔4行打印出来
  133. July8 20:43:38 ~
  134. $ cat readme.txt | sed -n '/ee/p'
  135. Welcome to Biotrainee() !
  136. Please feel free to contact with me( email to jmzeng1314@163.com )
  137. (http://www.biotrainee.com/thread-1376-1-1.html)
  138. # 查找ee所在的行并打印出来
  139. July8 20:43:47 ~
  140. $ cat readme.txt | sed '/ee/p'
  141. Welcome to Biotrainee() !
  142. Welcome to Biotrainee() !
  143. This is your personal account in our Cloud.
  144. Have a fun with it.
  145. Please feel free to contact with me( email to jmzeng1314@163.com )
  146. Please feel free to contact with me( email to jmzeng1314@163.com )
  147. (http://www.biotrainee.com/thread-1376-1-1.html)
  148. (http://www.biotrainee.com/thread-1376-1-1.html)
  149. # 不加-n就会所有内容打印两遍,因为还有默认输出。
  150. #-n :即no 取消默认输出,只显示经过sed处理或匹配的行(常用)
  151. July8 20:44:07 ~
  152. $ cat readme.txt | sed 's/ee/888888888/p'
  153. Welcome to Biotrain888888888() !
  154. Welcome to Biotrain888888888() !
  155. This is your personal account in our Cloud.
  156. Have a fun with it.
  157. Please f888888888l free to contact with me( email to jmzeng1314@163.com )
  158. Please f888888888l free to contact with me( email to jmzeng1314@163.com )
  159. (http://www.biotrain888888888.com/thread-1376-1-1.html)
  160. (http://www.biotrain888888888.com/thread-1376-1-1.html)
  161. #未加n,将所有修改行都打印两遍
  162. July8 20:46:56 ~
  163. $ cat readme.txt | sed 's/ee/888888888/'
  164. Welcome to Biotrain888888888() !
  165. This is your personal account in our Cloud.
  166. Have a fun with it.
  167. Please f888888888l free to contact with me( email to jmzeng1314@163.com )
  168. (http://www.biotrain888888888.com/thread-1376-1-1.html)
  169. # 不加p就打印所有内容,包括修改行
  170. July8 20:47:29 ~
  171. $ cat readme.txt | sed -n 's/ee/888888888/p'
  172. Welcome to Biotrain888888888() !
  173. Please f888888888l free to contact with me( email to jmzeng1314@163.com )
  174. (http://www.biotrain888888888.com/thread-1376-1-1.html)
  175. # 加n只输出修改行
  176. July8 20:52:20 ~
  177. $ cat readme.txt | tr 'a-z' 'A-Z'
  178. WELCOME TO BIOTRAINEE() !
  179. THIS IS YOUR PERSONAL ACCOUNT IN OUR CLOUD.
  180. HAVE A FUN WITH IT.
  181. PLEASE FEEL FREE TO CONTACT WITH ME( EMAIL TO JMZENG1314@163.COM )
  182. (HTTP://WWW.BIOTRAINEE.COM/THREAD-1376-1-1.HTML)
  183. # tr命令将小写换成大写
  184. July8 20:52:37 ~
  185. $ cat readme.txt | sed 'y/abcde/ABCDE/'
  186. WElComE to BiotrAinEE() !
  187. This is your pErsonAl ACCount in our ClouD.
  188. HAvE A fun with it.
  189. PlEAsE fEEl frEE to ContACt with mE( EmAil to jmzEng1314@163.Com )
  190. (http://www.BiotrAinEE.Com/thrEAD-1376-1-1.html)
  191. # sed将5个小写替换成大写,且为1对1的替换

awk,编程语言,可对文本和数据进行处理

image.png
image.png

image.png
image.png
用cut和awk同样取第9列会输出不一样的结果,因为awk默认把空格识别成分隔符
image.png
-F参数指定分隔符为tab键,输出结果就和cut一样了

image.png
取第9和第10列

  1. July8 21:22:24 ~/Data
  2. $ cat example.gtf | awk '{print $1,$3,$5,$10}' | head
  3. chr1 UTR 2090 "ENSG00000223972";
  4. chr1 exon 2090 "ENSG00000223972";
  5. chr1 transcript 4275 "ENSG00000223972";
  6. chr1 gene 4275 "ENSG00000223972";
  7. chr1 exon 1920 "ENSG00000223972";
  8. chr1 transcript 3533 "ENSG00000223972";
  9. chr1 exon 2090 "ENSG00000223972";
  10. chr1 exon 2560 "ENSG00000223972";
  11. chr1 UTR 2584 "ENSG00000223972";
  12. chr1 exon 2584 "ENSG00000223972";
  13. July8 21:22:56 ~/Data
  14. #取第1、3、5、10列
  15. $ cat example.gtf | awk '{print $3,$5,$10,$1}' | head
  16. UTR 2090 "ENSG00000223972"; chr1
  17. exon 2090 "ENSG00000223972"; chr1
  18. transcript 4275 "ENSG00000223972"; chr1
  19. gene 4275 "ENSG00000223972"; chr1
  20. exon 1920 "ENSG00000223972"; chr1
  21. transcript 3533 "ENSG00000223972"; chr1
  22. exon 2090 "ENSG00000223972"; chr1
  23. exon 2560 "ENSG00000223972"; chr1
  24. UTR 2584 "ENSG00000223972"; chr1
  25. exon 2584 "ENSG00000223972"; chr1
  26. July8 21:23:14 ~/Data
  27. #也可以改变输出顺序,cut取的话只能按顺序取
  28. $ cat example.gtf | awk '{print $1,$3,$5,$10,$1}' | head
  29. chr1 UTR 2090 "ENSG00000223972"; chr1
  30. chr1 exon 2090 "ENSG00000223972"; chr1
  31. chr1 transcript 4275 "ENSG00000223972"; chr1
  32. chr1 gene 4275 "ENSG00000223972"; chr1
  33. chr1 exon 1920 "ENSG00000223972"; chr1
  34. chr1 transcript 3533 "ENSG00000223972"; chr1
  35. chr1 exon 2090 "ENSG00000223972"; chr1
  36. chr1 exon 2560 "ENSG00000223972"; chr1
  37. chr1 UTR 2584 "ENSG00000223972"; chr1
  38. chr1 exon 2584 "ENSG00000223972"; chr1
  39. #还可以重复取

image.png
匹配所有含有UTR的行,并且输出这些行

image.png
开始打印一个find UTR 巴啦啦,然后匹配所有含有UTR的行,并且输出这些行,最后打印end

awk的内置变量

image.png

  1. $ less -S example.gtf | awk '{print $3,$4,$5}'| head
  2. UTR 1737 2090
  3. exon 1737 2090
  4. transcript 1737 4275
  5. gene 1737 4275
  6. exon 1873 1920
  7. transcript 1873 3533
  8. exon 2042 2090
  9. exon 2476 2560
  10. UTR 2476 2584
  11. exon 2476 2584
  12. # 打印出3、4、5列
  13. $ less -S example.gtf | awk 'BEGIN{OFS=":"} {print $3,$4,$5}'| head
  14. UTR:1737:2090
  15. exon:1737:2090
  16. transcript:1737:4275
  17. gene:1737:4275
  18. exon:1873:1920
  19. transcript:1873:3533
  20. exon:2042:2090
  21. exon:2476:2560
  22. UTR:2476:2584
  23. exon:2476:2584
  24. #将列与列之间的分隔符号换成: !!!记得一定要写双引号
  25. July8 22:22:08 ~/Data
  26. $ less -S example.gtf | awk '{print $3":"$4"\t"$5}'| head | cat -A
  27. UTR:1737^I2090$
  28. exon:1737^I2090$
  29. transcript:1737^I4275$
  30. gene:1737^I4275$
  31. exon:1873^I1920$
  32. transcript:1873^I3533$
  33. exon:2042^I2090$
  34. exon:2476^I2560$
  35. UTR:2476^I2584$
  36. exon:2476^I2584$
  37. # 也可以直接在打印的时候加上分隔符
  38. $ less -S example.gtf | awk '{print $3":"$4"\t"$5}'| head
  39. UTR:1737 2090
  40. exon:1737 2090
  41. transcript:1737 4275
  42. gene:1737 4275
  43. exon:1873 1920
  44. transcript:1873 3533
  45. exon:2042 2090
  46. exon:2476 2560
  47. UTR:2476 2584
  48. exon:2476 2584
  49. # 正常显示

image.png
输入的分隔符为tab,NR为行号

awk的条件和循环语句

image.png
image.png
判断的语句写在()内
image.png
是就打印出整行,不是就打印出第三行+is not gene

image.png
for循环,这个是按照每一行来循环,表示从第一行开始,输出第一行的第一列,然后 i++,就输出第一行的第二列,再 i++,就输出第一行的第三列。再从第二行开始。。。。。
image.png
paste 给三个座位

image.png
image.png
计算外显子长度
image.png
除法
image.png
取整
和四舍五入不对,所以可以对每个结果+0.5,然后再取整

image.png
练习题

引自生信技能树