以下的突变检测都是基于NGS测序的结果,但是需要预先使用bwa等工具预先比对到参考序列或者基因组上,
然后根据要研究的目标区域提取比对上的序列作进一步的分析,一个有用的方式就是使用samtools.

  1. samtools view -b -F chrN:NNNNN-NNNNN > contig_1.bam

因为提取出来的都是短序列,因此,我们可以在下游使用全局比对(Needleman-Wunsch)来获取比对详情。
经过简单的处理,我将在bam文件中提取比对到参考序列上的短序列并存入Fasta格式
query:

  1. >10613
  2. CATTGAGATGGTGTGGGAAGGGGCCC
  3. CATTGAGATAGTGTGGCCCGTTTGAAGGGGCCC
  4. CATTGAGATAGAAGGGGCCC
  5. CATTGAGATAGAAGGGGCCC
  6. CATTGAGATAGTGTGG
  7. CATTGAGATAGGCCC
  8. CATTGAGATAGTGTGGGCCC
  9. CATTGAGATAGTGTGGGAAGGGGCCC
  10. CATTGAGATAGTGTGGGAAGGGGCCC
  11. CATTGAGATAGTGTGGGCCC
  12. CATTGAGATAGTGTGGGAAGGGGCCC
  13. CATTGAGATAGTGTGGGAAGGGGCCC
  14. CATTGAGATAGTGTGGGCCC
  15. CATTGAGATAGTGCCC
  16. CATTGGGCCC
  17. CATTGAGATAGGGCCC
  18. CATTGAGATATGTGGGGAAGGGGCCC
  19. '''''

ref:

  1. CATTGAGATAGTGTGGGGAAGGGGCCC

突变检测

目标:统计出indel,snp等常见突变的位置,突变量(nbp突变)以及突变频率

需要借助python多序列比对模块,这里我们采用github上的一个

align

  1. ##导入所有需要的模块
  2. from align.calign import aligner as caligner
  3. from align.matrix import DNAFULL, BLOSUM62
  4. from collections import Counter
  5. import re
  6. from pyfaidx import Fasta
  7. import pandas as pd
  8. import numpy as np
  9. ##定义读入fasta序列的函数
  10. def read_res(fs):
  11. fasta={}
  12. with open(fs,'r') as f:
  13. for line in f:
  14. if line.startswith(">"):
  15. seq_lst=[]
  16. name=line.strip("\n").replace(">","")
  17. else:
  18. seq_lst.append(line.strip("\n"))
  19. fasta[name]=seq_lst
  20. return fasta
  21. ###采用正则表达式来提取突变
  22. def clean_insertion(ref,query):
  23. ins_lst=[]
  24. delet=[]
  25. for site in re.finditer(r"-+",ref):
  26. (start,end)=site.span()
  27. length=end-start
  28. ins_lst.append("{}I{}".format(start+1,length))
  29. delet.append(list(range(start,end)))
  30. delet_2=[y for x in delet for y in x]
  31. query="".join([query[i] for i in range(len(query)) if i not in delet_2])
  32. ref=ref.replace("-","")
  33. ins_lst2=";".join(ins_lst)
  34. return (ref,query,ins_lst2)
  35. def detective_del(ref,query):
  36. deletion=[]
  37. for st in re.finditer(r"-+",query):
  38. (start,end)=st.span()
  39. length=end-start
  40. deletion.append("{}D{}".format(start+1,length))
  41. deletion2=";".join(deletion)
  42. return (ref,query,deletion2)
  43. def detective_snp(ref,query):
  44. snp_lst=["{}M1".format(st+1) for st in range(len(ref)) if (query[st]!=ref[st])&(query[st]!="-")&(ref[st]!="-")]
  45. snp=";".join(snp_lst)
  46. return (ref,query,snp)
  47. def crispr_det(querys,ref):
  48. query_lst=Counter(querys)
  49. alreads=sum(list(query_lst.values()))
  50. fin=[]
  51. for key,val in query_lst.items():
  52. bijiao=caligner(ref,key,method="global",matrix=DNAFULL)
  53. (ref,query,insertions)=clean_insertion(bijiao[0].seq1.decode(),bijiao[0].seq2.decode())
  54. (ref,query,deletions)=detective_del(ref,query)
  55. (ref,query,mutations)=detective_snp(ref,query)
  56. tmp={"reference":ref,"query":query,"ins":insertions,"del":deletions,"mut":mutations,"percentage":"{:.2f}".format(val/alreads*100)}
  57. fin.append(tmp)
  58. return fin
  59. ##读入query&ref
  60. test=read_res("query.fa")
  61. reference = Fasta("reference.fasta")
  62. querys=test['10613']
  63. ref=reference["10613"][113:140].seq ## 请查阅 pyfaidx 模块介绍
  64. df=pd.DataFrame(crispr_det(querys,ref))
  65. df['percentage']=df['percentage'].astype(np.float64)
  66. df=df.sort_values('percentage',axis=0,ascending=False)
  67. df.head(10)

运行完后可以查看最终获得的结果:

reference query ins del mut percentage
6 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAGTGTGGG-AAGGGGCCC 18D1 16.53
5 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAGTGTGGG———-CCC 18D7 10.93
10 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATA-TGTGGGGAAGGGGCCC 11D1 10.10
30 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAGTGTGGGG———CCC 19D6 7.60
2 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAG———-AAGGGGCCC 12D7 5.34
14 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAGTGTGG————CCC 17D8 4.42
3 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAGTGTGG—————- 17D11 3.67
9 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGATAG-G-G————-CCC 12D1;14D1;16D9 3.09
13 CATTGAGATAGTGTGGGGAAGGGGCCC CATTGAGA-AG-G-GG————CCC 9D1;12D1;14D1;17D8 2.92
18 CATTGAGATAGTGTGGGGAAGGGGCCC C—-GA-A—G-G-GG————CCC 2D3;7D1;9D2;12D1;14D1;17D8 2.67

对以上结果的可视化展示

因为R语言画图是很方便的,所以这里采用R语言热图展示突变情况

  1. ##数据处理
  2. data<-read_delim("ontarget6.mutation.tsv",delim = "\t")
  3. ref<-stringr::str_split_fixed(data$reference[1],"",n=27)
  4. ref<-as.character(ref)
  5. need<-data%>%filter(percentage>=1)%>%tidyr::replace_na(replace = list(ins="-",del="-",mut="-"))
  6. plotData<-stringr::str_split_fixed(need$query,"",n=27)
  7. gRNA<-c(rep("1",23),rep("2",4))
  8. ##画
  9. Heatmap(plotData,
  10. cluster_rows = FALSE,
  11. cluster_columns = FALSE,
  12. column_split = c(rep("1",17),rep("2",10)),
  13. column_title = NULL,
  14. col = c("A"="#d1e8cd","G"="#ffffda","C"="#fae6d1","T"="#e4daec","-"="#e8e9ea"),
  15. cell_fun = function(j,i,x,y,wt,ht,fill){
  16. if(plotData[i,j]!=ref[j]){
  17. grid.text(sprintf("%s", plotData[i, j]), x, y, gp = gpar(fontsize = 10,fontface="bold"))
  18. }else{
  19. grid.text(sprintf("%s", plotData[i, j]), x, y, gp = gpar(fontsize = 10))
  20. }
  21. },
  22. top_annotation = HeatmapAnnotation(ref=anno_simple(ref,pch=ref,col = c("A"="#d1e8cd","G"="#ffffda","C"="#fae6d1","T"="#e4daec")),
  23. sgRNA=anno_simple(gRNA,col = c("1"="grey","2"="white")),
  24. annotation_name_gp = gpar(fontsize=10,fontface="bold")),
  25. right_annotation = rowAnnotation(percentage=anno_text(sprintf("%s%%",need$percentage),gp=gpar(fontsize=10,fontface="bold")),
  26. insertion=anno_text(need$ins,gp=gpar(fontsize=10,fontface="bold")),
  27. deletion=anno_text(need$del,gp=gpar(fontsize=10,fontface="bold")),
  28. substitutions=anno_text(need$mut,gp=gpar(fontsize=10,fontface="bold")),
  29. gap=unit(2,'char')),
  30. show_heatmap_legend = FALSE
  31. )

结果

CRISPR 突变检测及可视化分析 - 图1