前言

CRISPR因为具有NHEJ效应,在修复过程中会造成一定长度的突变,这种突变包括小片段插入(insertion),小片段缺失(deletion),单碱基突变(substitution)

为了有效准确的分析突变情况,我们需要对测序序列和参考序列进行全局比对,以期望发现不同种类的突变情况

为了加速双序列比对,我们采用cython模块来加速,这里就直接采用别人写好的开源程序Github: brentp / align

  1. git clone https://github.com/brentp/align.git
  2. cd align
  3. python3 setup.py build
  4. python3 setup.py install

这个模块的用法也很简单,但要注意该模块包含pythoncython两种比对,当然cython的速度会更快

  1. #for python align
  2. from align.align import aligner as pyaligner
  3. #for cython align
  4. from align.calign import aligner as caligner
  5. from align.matrix import DNAFULL

序列比对有三种方法,分别是global,glogal_cfe,glocal分别针对全局比对,全局长序列比对以及局部比对。

下面直接就是主函数

  1. from align.calign import aligner
  2. from align.matrix import DNAFULL
  3. import re
  4. from collections import Counter
  5. import argparse
  6. parser = argparse.ArgumentParser(
  7. description='This script was used to calculate indel of 37bp trap library')
  8. parser.add_argument(
  9. '-r', '--ref', help='Please input your reference file')
  10. parser.add_argument(
  11. '-f', '--fasta', help='Please input your filter sequence.')
  12. parser.add_argument('-e', '--edits', help='Please input your edits style. eg: insertion;deletion;sub')
  13. parser.add_argument('-m', '--methods', help='Please input your select method. eg: size;pos;ss')
  14. args = parser.parse_args()
  15. ##seq1 is ref
  16. def find_insertion(str1, str2):
  17. indel = []
  18. j = 0
  19. for i in re.finditer("-+", str1):
  20. pos = i.start() + 1 - j
  21. size = i.end() - i.start()
  22. ss = str2[i.start(): i.end()]
  23. j = j + size
  24. indel.append((pos, size, ss))
  25. return indel
  26. def find_deletion(str1, str2):
  27. indel = []
  28. new_s1 = str1.replace("-", "")
  29. new_s2 = "".join([str2[i] for i in range(len(str1)) if str1[i] != "-"])
  30. for i in re.finditer("-+", new_s2):
  31. pos = i.start() + 1
  32. size = i.end() - i.start()
  33. ss = new_s1[i.start(): i.end()]
  34. indel.append((pos, size, ss))
  35. return indel
  36. def find_sub(str1, str2):
  37. sub = []
  38. new_s1 = str1.replace("-", "")
  39. new_s2 = "".join([str2[i] for i in range(len(str1)) if str1[i] != "-"])
  40. for i in range(len(new_s1)):
  41. if new_s1[i] != new_s2[i] and new_s2[i] != "-":
  42. pos = i + 1
  43. size = 1
  44. ss = new_s2[i]
  45. sub.append((pos, size, ss))
  46. else:
  47. continue
  48. return sub
  49. if args.ref:
  50. lb_lst = []
  51. rf_lst = []
  52. with open(args.ref, "r") as f:
  53. for line in f:
  54. label, ref_seq = line.split(",")
  55. lb_lst.append(label.strip())
  56. rf_lst.append(ref_seq.strip("\n"))
  57. else:
  58. print("No ref input!")
  59. exit(0)
  60. if args.fasta:
  61. fasta = {}
  62. with open(args.fasta, 'r') as f:
  63. for line in f:
  64. if line.startswith(">"):
  65. ss_lst = []
  66. name = line.strip("\n").replace(">", "")
  67. elif line != "\n":
  68. ss_lst.append(line.strip("\n"))
  69. fasta[name] = ss_lst
  70. else:
  71. continue
  72. else:
  73. print("No sequence input!")
  74. exit(0)
  75. haha_lst = []
  76. # calculate insertion for every site
  77. for lb, rf in zip(lb_lst, rf_lst):
  78. for val in fasta[lb]:
  79. ag = aligner(val, rf, method="global_cfe", matrix=DNAFULL)
  80. bidui = ag.pop()
  81. ctrl = bidui.seq2.decode()
  82. case = bidui.seq1.decode()
  83. if args.edits == "insertion":
  84. haha_lst.extend(find_insertion(ctrl, case))
  85. elif args.edits == "deletion":
  86. haha_lst.extend(find_deletion(ctrl, case))
  87. elif args.edits == "sub":
  88. haha_lst.extend(find_sub(ctrl, case))
  89. else:
  90. print("please input correct call methods!eg:insertion;deletion;sub")
  91. if len(haha_lst) > 0:
  92. if args.methods == "pos":
  93. new_lst = [haha[0] for haha in haha_lst if len(haha) == 3]
  94. fin = Counter(new_lst)
  95. print("{} in every site is:".format(args.edits))
  96. for i in range(37):
  97. if fin.get((i+1), 0):
  98. print("{}\t{}".format((i + 1), fin[i + 1]))
  99. else:
  100. print("{}\t{}".format((i+1), 0))
  101. elif args.methods == "size":
  102. new_lst = [haha[1] for haha in haha_lst if len(haha) == 3]
  103. fin = Counter(new_lst)
  104. print("{} in every size is:".format(args.edits))
  105. for i in range(1, 10):
  106. if fin.get((i), 0):
  107. print("{}\t{}".format((i), fin[i]))
  108. else:
  109. print("{}\t{}".format((i), 0))
  110. elif args.methods == "ss":
  111. new_lst = [haha[2] for haha in haha_lst if len(haha) == 3]
  112. fin = Counter(new_lst)
  113. print("{} in every sequence is:".format(args.edits))
  114. for key, vals in fin.items():
  115. print("{}\t{}".format(key, vals))
  116. else:
  117. print("No {}".format(args.edits))

用法

如果想知道fasta序列中相对于ref序列,分别在1-37bp上的插入是多少,可以键入一下参数

  1. python3 cal_indel.py -r ref.txt -f fasta.fa -e insertion -m pos
  1. #cat ref.txt
  2. 1,AGAAGTGGAGCTGCAGCTGCAGGCAGCTCCCGGATCC
  3. 2,AGTGGCTGAGTATGATCAGTGTCCAGTGTCTGGCCCA

参考序列是两列文件,第一列代表序号,第二列代表参考序列

  1. #cat fasta.fa
  2. >1
  3. AGAAGTGGAGCTGCAGCTGCAGGCAGCCTCCCGGATCC
  4. AGAAGTGGAGCTGCAGATCC
  5. AGAAGTGGAGCTGCAGATCC
  6. AGAAGTGGAGCT
  7. AGAAGTGGAGCTGCAGCTGCAGGCAGCCTCCCGGATCC
  8. AGAAGTGGAGCTGCAGCTGCAGGCAGAGCTCCTGGATCC
  9. AGAAGTGGAGCTGCAGCTGCAGGCAGAGCTCCTGGATCC
  10. AGAAGTGGAGCT
  11. AGAAGTGGAGCTGCAGCTGCAGGCTGGCCCGGATCC
  12. >2
  13. AGTGGCTGAGTATGATCAGTGTCCAGTCTGGCCCA
  14. AGTGGCTGAGTATGATCAGTGTCTGGCCCA
  15. AGTGGCTGAGTATGATCAGCCCA
  16. AGTGGCTGAGTATGATCAGTGTCCA
  17. AGTGGCTGAGTATGATCAGTGTCCAGTGTCTGGCCCA
  18. AGTGGCTGAGTATGATCAGTGTGTCTGGCCCA
  19. AGTGGCTGAATATGATCAGTGTCCAGTTGTCTGGCCCA
  20. AGTGGCTGAGTATGATCAGTGTCCAG
  21. AGTGGCTGAGTATGATCAGTGTGTCTGGCCCA
  22. AATGGCTGAGTATGATCAGTGTCCAGATCTGGCCCA
  23. AGTGGCTGAATAGTCTGGCCCA

输入序列为序列名称为参考序列对应的编号,而后面跟着以换行符分隔的小片段序列

-e参数可以分别输入insertion,deletion,sub,分别对应插入,缺失以及单碱基替换
-m参数可以分别输入size,pos,ss分别对应按照大小,位置和序列进行统计

注意:如果报global_cfe的错误,就是因为比对序列太多了,更换比对参数为global即可!