前言

Crispr基因编辑正越来越广泛的应用在各个方面,包括科研,医疗等等,针对经过筛选的药物靶向基因设计gRNA,使其由原始的基因序列突变为终止密码子,从而无法表达蛋白,进而治疗疾病或者抵抗药物

算法

Gene 沉默效率计算方法 - 图1

初步过滤:

  1. 将12ktrap中,N1-N20内C>T成功的trap提取出来(所有编辑后的reads,不做trap的去重),定义为dataset#1
  2. 将dataset#1分成2部分,一部分是gRNA在基因的正义链(+)上,定义为dataset#1(+),另一部分在反义链(-)上,定义为dataset#1(-);

正义链的情况(CAG/CGA/CAA > TAG/TGA/TAA)

  1. 找出dataset#1(+)N1-N20中 C>T成功的位置,并往后推2个碱基,判断其是否为CAG或CGA或CAA(编辑前) ,满足其一即抽提出来,完成后定义为dataset#2(+),同时标记出突变位点在gRNA中的位置,譬如N=5, N=6等;
  2. 将dataset#2(+)中的gRNA(不包含我们人工添加的第一个碱基G)mapping回对应基因的CDS中,数出从ATG(翻译起始位置)的A到gRNA第一个碱基的碱基数M,随后算出 (M-1+N-1)/3, 判断结果是否为整数,如果是整数,判断该基因被stop成功,输出”TRAP ID + gRNA sequence + strand (+/-) + C>T position + C>T efficiency”;
  3. 举例说明4)中算法:如N=6, M=11, 则说明突变位点的C距离ATG位置是10+5=15个碱基,即15/3=5个氨基酸,到突变位点(如CAG)恰好是3的整数倍,该基因被stop成功;

反义链的情况(CCA > TCA (TGA) / CTA(TAG) / TTA (TAA))

  1. 找出dataset#1(-)N1-N20中 C>T成功的位置,并判断其是否在CCA这样的motif中(编辑前),如果在,输出trap放到dataset#2(-)中,并标记出CCA motif中A所在gRNA中的位置N,譬如N=5, N=6等;
  2. 将dataset#2(-)中的gRNA提取出来(此时gRNA的序列应该是和CDS正义链互补的),数出从ATG(翻译起始位置)的A到gRNA第一个碱基的碱基数M (注意此时gRNA为CCNN20,gRNA第一个碱基在最右端),随后算出(M-N)/3是否为整数,如果是整数,判断该基因被stop成功,输出”TRAP ID + gRNA sequence + strand (+/-) + C>T position + C>T efficiency”;

需求是:

  1. 获取每条TRAP序列所对应的stop效率,效率的计算方法是针对一条TRAP来说,至少有一个位点能够stop成功,那么这条对应的read为成功一次,一次类推,既不能重复计数,也不能少计数
  2. 获取每个位点对应的stop效率,如上原理所述
  3. 获取基因的stop率:stop-gene/all-gene

我使用的代码

代码一:此代码用于输出所有通过的stop reads并计算每条TRAP的stop效率

  1. # -*- coding=utf-8 -*-
  2. import re
  3. def complement(seq):
  4. return seq.translate(str.maketrans('ACGT', 'TGCA'))
  5. # Obtain reverse complementary sequence
  6. def revcomp(seq):
  7. return complement(seq)[::-1]
  8. # Input n site, all CDs sequences and trap37bp sequences, and check whether the site can stop
  9. # 1 for success, 0 for failure
  10. def stop_test_for_Z(n, cds_lst, trap):
  11. new_cds_lst = [cds for cds in cds_lst if trap in cds]
  12. score_lst = []
  13. if len(new_cds_lst) > 0:
  14. for haha in new_cds_lst:
  15. m = haha.find(trap) + 1 + 10
  16. score = (m - 1 + n - 1)
  17. score_lst.append(score)
  18. if len([new_score for new_score in score_lst if new_score % 3 == 0]) > 0:
  19. return 1
  20. else:
  21. return 0
  22. else:
  23. return 0
  24. def stop_test_for_F(n, cds_lst, trap):
  25. trap = revcomp(trap)
  26. new_cds_lst = [cds for cds in cds_lst if trap in cds]
  27. score_lst = []
  28. if len(new_cds_lst) > 0:
  29. for haha in new_cds_lst:
  30. m = haha.find(trap) + 1 + 10 + 23
  31. score = (m - n)
  32. score_lst.append(score)
  33. if len([new_score for new_score in score_lst if new_score % 3 == 0]) > 0:
  34. return 1
  35. else:
  36. return 0
  37. else:
  38. return 0
  39. # Read all CDS files and return a dictionary. The key is gene and the value is CDS sequence
  40. def read_cds(fs):
  41. fasta = {}
  42. with open(fs, 'r') as f:
  43. for line in f:
  44. if line.startswith(">"):
  45. seq_lst = []
  46. name = line.strip("\n").replace(">", "")
  47. else:
  48. seq_lst.append(line.strip("\n"))
  49. fasta[name] = "".join(seq_lst)
  50. new_fasta = {key: value for key,
  51. value in fasta.items() if "unavailable" not in value}
  52. return new_fasta
  53. # Read all the trap sequences and return a dictionary. The key is trap label and the value is 37bp
  54. def read_seq(fs):
  55. fasta = {}
  56. with open(fs, 'r') as f:
  57. for line in f:
  58. if line.startswith(">"):
  59. seq_lst = []
  60. name = line.strip("\n").replace(">", "")
  61. else:
  62. seq_lst.append(line.strip("\n"))
  63. fasta[name] = seq_lst
  64. return fasta
  65. # Read all reference sequences. The key is label and the value is 37bptrap sequence
  66. def read_ref(fs):
  67. ref = {}
  68. with open(fs, 'r') as f:
  69. for line in f:
  70. if line.startswith(">"):
  71. name = line.strip("\n").replace(">", "")
  72. else:
  73. ref[name] = line.strip("\n")[118:155]
  74. return ref
  75. # Read the corresponding relationship between gene and label and return to the dictionary
  76. def read_duiying(fs):
  77. duiying = {}
  78. with open(fs, 'r') as f:
  79. for line in f:
  80. lable, gene = line.split("\t")
  81. duiying[lable.strip("\n")] = gene.strip("\n")
  82. return duiying
  83. def motif_Z(trap, site):
  84. if trap[10:30][site: site + 3] in ["CAG", "CGA", "CAA"]:
  85. return 1
  86. else:
  87. return 0
  88. def motif_F(trap, site):
  89. if "CCA" in trap[10:30][site-2: site + 3]:
  90. return 1
  91. else:
  92. return 0
  93. lable_map_gene = read_duiying("../duiying.txt")
  94. all_cds = read_cds("../mart_export.txt")
  95. all_ref = read_ref("ref.fa")
  96. all_trapseq = read_seq("new508.filter.fa")
  97. eff_lst = []
  98. for name, traps in all_ref.items():
  99. gene = lable_map_gene[name]
  100. cds_lst = [vals for keys, vals in all_cds.items() if gene in keys]
  101. pass_lst = []
  102. if name in list(all_trapseq.keys()):
  103. alltrap = len(all_trapseq[name])
  104. seq_lst = all_trapseq[name]
  105. for m in re.finditer("C", traps[10:30]):
  106. if motif_Z(traps, m.start()):
  107. if stop_test_for_Z((m.start() + 1), cds_lst, traps):
  108. while seq_lst:
  109. gg=seq_lst.pop(0)
  110. if gg[10:30][m.start()] == "T":
  111. pass_lst.append(gg)
  112. if len(pass_lst) > 0:
  113. with open("pass_lst_z.fa", 'a') as wocao:
  114. wocao.write(">{}\n".format(name))
  115. for nima in pass_lst:
  116. wocao.write("{}\n".format(nima))
  117. if alltrap > 0:
  118. eff_lst.append([name, len(pass_lst) / alltrap])
  119. else:
  120. print(name)
  121. for name, traps in all_ref.items():
  122. gene = lable_map_gene[name]
  123. cds_lst = [vals for keys, vals in all_cds.items() if gene in keys]
  124. pass_lst = []
  125. if name in list(all_trapseq.keys()):
  126. alltrap = len(all_trapseq[name])
  127. seq_lst2 = all_trapseq[name]
  128. for m in re.finditer("C", traps[10:30]):
  129. if motif_F(traps, m.start()):
  130. if stop_test_for_F(((traps[10:30][m.start()-2: m.start() + 3]).find("CCA")+1+2), cds_lst, traps):
  131. while seq_lst2:
  132. gg=seq_lst2.pop(0)
  133. if gg[10:30][m.start()] == "T":
  134. pass_lst.append(gg)
  135. if len(pass_lst) > 0:
  136. with open("pass_lst_f.fa", 'a') as wocao:
  137. wocao.write(">{}\n".format(name))
  138. for nima in pass_lst:
  139. wocao.write("{}\n".format(nima))
  140. if alltrap > 0:
  141. eff_lst.append([name, len(pass_lst) / alltrap])
  142. else:
  143. print(name)
  144. with open("trap_stop_eff.txt", 'a') as l:
  145. l.write("trap-lable\ttrap-effective\n")
  146. for lable, effctive in eff_lst:
  147. l.write("{}\t{}\n".format(lable, effctive))

代码二:该代码用于计算需求2

  1. # -*- coding=utf-8 -*-
  2. import re
  3. def complement(seq):
  4. return seq.translate(str.maketrans('ACGT', 'TGCA'))
  5. # 获取反向互补序列
  6. def revcomp(seq):
  7. return complement(seq)[::-1]
  8. # 输入n位点,对应的所有CDS序列以及trap37bp序列,检查该位点能否stop
  9. # 成功返回1,失败返回0
  10. def stop_test_for_Z(n, cds_lst, trap):
  11. new_cds_lst = [cds for cds in cds_lst if trap in cds]
  12. score_lst = []
  13. if len(new_cds_lst) > 0:
  14. for haha in new_cds_lst:
  15. m = haha.find(trap) + 1 + 10
  16. score = (m - 1 + n - 1)
  17. score_lst.append(score)
  18. if len([new_score for new_score in score_lst if score % 3 == 0]) > 0:
  19. return 1
  20. else:
  21. return 0
  22. else:
  23. return 0
  24. def stop_test_for_F(n, cds_lst, trap):
  25. trap = revcomp(trap)
  26. new_cds_lst = [cds for cds in cds_lst if trap in cds]
  27. score_lst = []
  28. if len(new_cds_lst) > 0:
  29. for haha in new_cds_lst:
  30. m = haha.find(trap) + 1 + 10 + 23
  31. score = (m - n)
  32. score_lst.append(score)
  33. if len([new_score for new_score in score_lst if score % 3 == 0]) > 0:
  34. return 1
  35. else:
  36. return 0
  37. else:
  38. return 0
  39. # 读取所有的cds文件返回一个字典,键为gene,值为cds序列
  40. def read_cds(fs):
  41. fasta = {}
  42. with open(fs, 'r') as f:
  43. for line in f:
  44. if line.startswith(">"):
  45. seq_lst = []
  46. name = line.strip("\n").replace(">", "")
  47. else:
  48. seq_lst.append(line.strip("\n"))
  49. fasta[name] = "".join(seq_lst)
  50. new_fasta = {key: value for key,
  51. value in fasta.items() if "unavailable" not in value}
  52. return new_fasta
  53. # 读取所有的trap序列返回一个字典,键为trap-lable,值为37bp列表
  54. def read_seq(fs):
  55. fasta = {}
  56. with open(fs, 'r') as f:
  57. for line in f:
  58. if line.startswith(">"):
  59. seq_lst = []
  60. name = line.strip("\n").replace(">", "")
  61. else:
  62. seq_lst.append(line.strip("\n"))
  63. fasta[name] = seq_lst
  64. return fasta
  65. # 读取所有的参考序列,键为lable,值为37bptrap序列
  66. def read_ref(fs):
  67. ref = {}
  68. with open(fs, 'r') as f:
  69. for line in f:
  70. if line.startswith(">"):
  71. name = line.strip("\n").replace(">", "")
  72. else:
  73. ref[name] = line.strip("\n")[118:155]
  74. return ref
  75. # 读取基因和lable的对应关系,返回字典
  76. def read_duiying(fs):
  77. duiying = {}
  78. with open(fs, 'r') as f:
  79. for line in f:
  80. lable, gene = line.split("\t")
  81. duiying[lable.strip("\n")] = gene.strip("\n")
  82. return duiying
  83. def motif_Z(trap, site):
  84. if trap[10:30][site: site + 3] in ["CAG", "CGA", "CAA"]:
  85. return 1
  86. else:
  87. return 0
  88. def motif_F(trap, site):
  89. if "CCA" in trap[10:30][site-2: site + 3]:
  90. return 1
  91. else:
  92. return 0
  93. lable_map_gene = read_duiying("../duiying.txt")
  94. all_cds = read_cds("../mart_export.txt")
  95. all_ref = read_ref("ref.fa")
  96. all_trapseq = read_seq("new508.filter.fa")
  97. with open("pass_lst_f.txt", 'a') as wocao:
  98. wocao.write(
  99. "TRAP ID\tGene\tgRNA sequence\tstrand\tC>T position\tC>T efficiency\n")
  100. with open("pass_lst_z.txt", 'a') as wocao:
  101. wocao.write(
  102. "TRAP ID\tGene\tgRNA sequence\tstrand\tC>T position\tC>T efficiency\n")
  103. for name, traps in all_ref.items():
  104. gene = lable_map_gene[name]
  105. cds_lst = [vals for keys, vals in all_cds.items() if gene in keys]
  106. pass_lst = []
  107. for m in re.finditer("C", traps[10:30]):
  108. if motif_Z(traps, m.start()):
  109. if stop_test_for_Z((m.start() + 1), cds_lst, traps) and name in list(all_trapseq.keys()):
  110. i = 0
  111. for gg in all_trapseq[name]:
  112. if gg[10:30][m.start()] == "T":
  113. i = i + 1
  114. eff = i/len(all_trapseq[name])
  115. pass_lst.append(
  116. [name, traps[10:33], "+", (m.start() + 1), eff])
  117. if len(pass_lst) > 0:
  118. with open("pass_lst_z.txt", 'a') as wocao:
  119. for nima in pass_lst:
  120. [trap_id, trap_seq, strand, ctposition, efftion] = nima
  121. wocao.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(
  122. trap_id, gene, trap_seq, strand, ctposition, efftion))
  123. for name, traps in all_ref.items():
  124. gene = lable_map_gene[name]
  125. cds_lst = [vals for keys, vals in all_cds.items() if gene in keys]
  126. pass_lst = []
  127. for m in re.finditer("C", traps[10:30]):
  128. if motif_F(traps, m.start()):
  129. if stop_test_for_F(((traps[10:30][m.start()-2: m.start() + 3]).find("CCA")+1+2), cds_lst, traps) and name in list(all_trapseq.keys()):
  130. j = 0
  131. for gg in all_trapseq[name]:
  132. if gg[10:30][m.start()] == "T":
  133. j = j + 1
  134. eff = j/len(all_trapseq[name])
  135. pass_lst.append(
  136. [name, traps[10:33], "-", (m.start() + 1), eff])
  137. if len(pass_lst) > 0:
  138. with open("pass_lst_f.txt", 'a') as wocao:
  139. for nima in pass_lst:
  140. [trap_id, trap_seq, strand, ctposition, efftion] = nima
  141. wocao.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(
  142. trap_id, gene, trap_seq, strand, ctposition, efftion))

该博客仅记录工作流程,以作为备忘录,请勿过分借鉴,也不要转载,感谢!