Problem

  1. Problem
  2. A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.
  3. A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).
  4. For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as st; we demand st to prevent directed loops in the overlap graph (although directed cycles may be present).

问题:
input: 对于给定的fasta文件,和给定的overlap size
output:每对具有overlap的序列名称

  1. input:
  2. overlap size = 3
  3. fasta:
  4. >Rosalind_0498
  5. AAATAAA
  6. >Rosalind_2391
  7. AAATTTT
  8. >Rosalind_2323
  9. TTTTCCC
  10. >Rosalind_0442
  11. AAATCCC
  12. >Rosalind_5013
  13. GGGTGGG
  14. output:
  15. Rosalind_0498 Rosalind_2391
  16. Rosalind_0498 Rosalind_0442
  17. Rosalind_2391 Rosalind_2323

思路:
fasta转换为dict(), {seq_name: (pre_seq, suf_seq), seq_name2: (pre_seq, suf_seq), ...}, 然后将序列名两两组合,构建一个函数is_have_overlap(comb ,mismatch = 0), 对每种组合判断,有overlap则输出

solution -1

  1. import itertools
  2. class OverlapGraphs:
  3. def __init__(self, fa_path, overlap_size = 3):
  4. self.fa_path = fa_path
  5. self.overlap_size = overlap_size
  6. def read_fa(self):
  7. with open(self.fa_path) as fd:
  8. for line in fd:
  9. yield line
  10. def get_fa_dict(self):
  11. seq_name_all = (seq_name.strip("\n").strip(">")
  12. for seq_name in self.read_fa() if seq_name.startswith(">"))
  13. seq_all = ((seq[0:self.overlap_size],
  14. seq.strip("\n")[-self.overlap_size:])
  15. for seq in self.read_fa() if not seq.startswith(">"))
  16. self.fa_dict = {seq_name: seq
  17. for seq_name, seq in zip(seq_name_all, seq_all)}
  18. def is_have_overlap(self, seq_name1, seq_name2):
  19. flag1 = (self.fa_dict[seq_name1][0] == self.fa_dict[seq_name2][1])
  20. flag2 = (self.fa_dict[seq_name1][1] == self.fa_dict[seq_name2][0])
  21. return flag1 or flag2
  22. def find_overlap(self):
  23. comb = itertools.combinations(list(self.fa_dict.keys()), 2)
  24. for seq_name1, seq_name2 in comb:
  25. if self.is_have_overlap(seq_name1, seq_name2):
  26. print(f"{seq_name1}\t{seq_name2}")
  27. def run(self):
  28. self.get_fa_dict()
  29. self.find_overlap()
  30. def main():
  31. test = OverlapGraphs("/root/py_test/fa.txt", 3)
  32. test.run()
  33. if __name__ == '__main__':
  34. main()