Problem
ProblemA graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).
问题:input: 对于给定的fasta文件,和给定的overlap sizeoutput:每对具有overlap的序列名称
input:overlap size = 3fasta:>Rosalind_0498AAATAAA>Rosalind_2391AAATTTT>Rosalind_2323TTTTCCC>Rosalind_0442AAATCCC>Rosalind_5013GGGTGGGoutput:Rosalind_0498 Rosalind_2391Rosalind_0498 Rosalind_0442Rosalind_2391 Rosalind_2323
思路:
将fasta转换为dict(), {seq_name: (pre_seq, suf_seq), seq_name2: (pre_seq, suf_seq), ...}, 然后将序列名两两组合,构建一个函数is_have_overlap(comb ,mismatch = 0), 对每种组合判断,有overlap则输出
solution -1
import itertoolsclass OverlapGraphs:def __init__(self, fa_path, overlap_size = 3):self.fa_path = fa_pathself.overlap_size = overlap_sizedef read_fa(self):with open(self.fa_path) as fd:for line in fd:yield linedef get_fa_dict(self):seq_name_all = (seq_name.strip("\n").strip(">")for seq_name in self.read_fa() if seq_name.startswith(">"))seq_all = ((seq[0:self.overlap_size],seq.strip("\n")[-self.overlap_size:])for seq in self.read_fa() if not seq.startswith(">"))self.fa_dict = {seq_name: seqfor seq_name, seq in zip(seq_name_all, seq_all)}def is_have_overlap(self, seq_name1, seq_name2):flag1 = (self.fa_dict[seq_name1][0] == self.fa_dict[seq_name2][1])flag2 = (self.fa_dict[seq_name1][1] == self.fa_dict[seq_name2][0])return flag1 or flag2def find_overlap(self):comb = itertools.combinations(list(self.fa_dict.keys()), 2)for seq_name1, seq_name2 in comb:if self.is_have_overlap(seq_name1, seq_name2):print(f"{seq_name1}\t{seq_name2}")def run(self):self.get_fa_dict()self.find_overlap()def main():test = OverlapGraphs("/root/py_test/fa.txt", 3)test.run()if __name__ == '__main__':main()
