前言
CRISPR因为具有NHEJ
效应,在修复过程中会造成一定长度的突变,这种突变包括小片段插入(insertion),小片段缺失(deletion),单碱基突变(substitution)。
为了有效准确的分析突变情况,我们需要对测序序列和参考序列进行全局比对,以期望发现不同种类的突变情况
为了加速双序列比对,我们采用cython
模块来加速,这里就直接采用别人写好的开源程序Github: brentp / align
git clone https://github.com/brentp/align.git
cd align
python3 setup.py build
python3 setup.py install
这个模块的用法也很简单,但要注意该模块包含python
和cython
两种比对,当然cython的速度会更快
#for python align
from align.align import aligner as pyaligner
#for cython align
from align.calign import aligner as caligner
from align.matrix import DNAFULL
序列比对有三种方法,分别是global
,glogal_cfe
,glocal
分别针对全局比对,全局长序列比对以及局部比对。
下面直接就是主函数
from align.calign import aligner
from align.matrix import DNAFULL
import re
from collections import Counter
import argparse
parser = argparse.ArgumentParser(
description='This script was used to calculate indel of 37bp trap library')
parser.add_argument(
'-r', '--ref', help='Please input your reference file')
parser.add_argument(
'-f', '--fasta', help='Please input your filter sequence.')
parser.add_argument('-e', '--edits', help='Please input your edits style. eg: insertion;deletion;sub')
parser.add_argument('-m', '--methods', help='Please input your select method. eg: size;pos;ss')
args = parser.parse_args()
##seq1 is ref
def find_insertion(str1, str2):
indel = []
j = 0
for i in re.finditer("-+", str1):
pos = i.start() + 1 - j
size = i.end() - i.start()
ss = str2[i.start(): i.end()]
j = j + size
indel.append((pos, size, ss))
return indel
def find_deletion(str1, str2):
indel = []
new_s1 = str1.replace("-", "")
new_s2 = "".join([str2[i] for i in range(len(str1)) if str1[i] != "-"])
for i in re.finditer("-+", new_s2):
pos = i.start() + 1
size = i.end() - i.start()
ss = new_s1[i.start(): i.end()]
indel.append((pos, size, ss))
return indel
def find_sub(str1, str2):
sub = []
new_s1 = str1.replace("-", "")
new_s2 = "".join([str2[i] for i in range(len(str1)) if str1[i] != "-"])
for i in range(len(new_s1)):
if new_s1[i] != new_s2[i] and new_s2[i] != "-":
pos = i + 1
size = 1
ss = new_s2[i]
sub.append((pos, size, ss))
else:
continue
return sub
if args.ref:
lb_lst = []
rf_lst = []
with open(args.ref, "r") as f:
for line in f:
label, ref_seq = line.split(",")
lb_lst.append(label.strip())
rf_lst.append(ref_seq.strip("\n"))
else:
print("No ref input!")
exit(0)
if args.fasta:
fasta = {}
with open(args.fasta, 'r') as f:
for line in f:
if line.startswith(">"):
ss_lst = []
name = line.strip("\n").replace(">", "")
elif line != "\n":
ss_lst.append(line.strip("\n"))
fasta[name] = ss_lst
else:
continue
else:
print("No sequence input!")
exit(0)
haha_lst = []
# calculate insertion for every site
for lb, rf in zip(lb_lst, rf_lst):
for val in fasta[lb]:
ag = aligner(val, rf, method="global_cfe", matrix=DNAFULL)
bidui = ag.pop()
ctrl = bidui.seq2.decode()
case = bidui.seq1.decode()
if args.edits == "insertion":
haha_lst.extend(find_insertion(ctrl, case))
elif args.edits == "deletion":
haha_lst.extend(find_deletion(ctrl, case))
elif args.edits == "sub":
haha_lst.extend(find_sub(ctrl, case))
else:
print("please input correct call methods!eg:insertion;deletion;sub")
if len(haha_lst) > 0:
if args.methods == "pos":
new_lst = [haha[0] for haha in haha_lst if len(haha) == 3]
fin = Counter(new_lst)
print("{} in every site is:".format(args.edits))
for i in range(37):
if fin.get((i+1), 0):
print("{}\t{}".format((i + 1), fin[i + 1]))
else:
print("{}\t{}".format((i+1), 0))
elif args.methods == "size":
new_lst = [haha[1] for haha in haha_lst if len(haha) == 3]
fin = Counter(new_lst)
print("{} in every size is:".format(args.edits))
for i in range(1, 10):
if fin.get((i), 0):
print("{}\t{}".format((i), fin[i]))
else:
print("{}\t{}".format((i), 0))
elif args.methods == "ss":
new_lst = [haha[2] for haha in haha_lst if len(haha) == 3]
fin = Counter(new_lst)
print("{} in every sequence is:".format(args.edits))
for key, vals in fin.items():
print("{}\t{}".format(key, vals))
else:
print("No {}".format(args.edits))
用法
如果想知道fasta序列中相对于ref序列,分别在1-37bp上的插入是多少,可以键入一下参数
python3 cal_indel.py -r ref.txt -f fasta.fa -e insertion -m pos
#cat ref.txt
1,AGAAGTGGAGCTGCAGCTGCAGGCAGCTCCCGGATCC
2,AGTGGCTGAGTATGATCAGTGTCCAGTGTCTGGCCCA
参考序列是两列文件,第一列代表序号,第二列代表参考序列
#cat fasta.fa
>1
AGAAGTGGAGCTGCAGCTGCAGGCAGCCTCCCGGATCC
AGAAGTGGAGCTGCAGATCC
AGAAGTGGAGCTGCAGATCC
AGAAGTGGAGCT
AGAAGTGGAGCTGCAGCTGCAGGCAGCCTCCCGGATCC
AGAAGTGGAGCTGCAGCTGCAGGCAGAGCTCCTGGATCC
AGAAGTGGAGCTGCAGCTGCAGGCAGAGCTCCTGGATCC
AGAAGTGGAGCT
AGAAGTGGAGCTGCAGCTGCAGGCTGGCCCGGATCC
>2
AGTGGCTGAGTATGATCAGTGTCCAGTCTGGCCCA
AGTGGCTGAGTATGATCAGTGTCTGGCCCA
AGTGGCTGAGTATGATCAGCCCA
AGTGGCTGAGTATGATCAGTGTCCA
AGTGGCTGAGTATGATCAGTGTCCAGTGTCTGGCCCA
AGTGGCTGAGTATGATCAGTGTGTCTGGCCCA
AGTGGCTGAATATGATCAGTGTCCAGTTGTCTGGCCCA
AGTGGCTGAGTATGATCAGTGTCCAG
AGTGGCTGAGTATGATCAGTGTGTCTGGCCCA
AATGGCTGAGTATGATCAGTGTCCAGATCTGGCCCA
AGTGGCTGAATAGTCTGGCCCA
输入序列为序列名称为参考序列对应的编号,而后面跟着以换行符分隔的小片段序列
-e
参数可以分别输入insertion
,deletion
,sub
,分别对应插入,缺失以及单碱基替换
-m
参数可以分别输入size
,pos
,ss
分别对应按照大小,位置和序列进行统计
注意:如果报global_cfe的错误,就是因为比对序列太多了,更换比对参数为global即可!