12.Overlap Graphs - 《Rosalind and BioLeetCode - 生信版 Leetcode》

Problem
- solution -1

Problem

Problem
A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.
A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).
For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).

问题：
input: 对于给定的fasta文件，和给定的overlap size
output:每对具有overlap的序列名称

input: 
overlap size = 3
fasta:
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
output:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323

思路：
将fasta转换为dict(), {seq_name: (pre_seq, suf_seq), seq_name2: (pre_seq, suf_seq), ...}, 然后将序列名两两组合，构建一个函数is_have_overlap(comb ,mismatch = 0), 对每种组合判断，有overlap则输出

solution -1

import itertools
class OverlapGraphs:
    def __init__(self, fa_path, overlap_size = 3):
        self.fa_path = fa_path
        self.overlap_size = overlap_size
    def read_fa(self):
        with open(self.fa_path) as fd:
            for line in fd:
                yield line
    def get_fa_dict(self):
        seq_name_all = (seq_name.strip("\n").strip(">") 
                        for seq_name in self.read_fa() if seq_name.startswith(">"))
        seq_all = ((seq[0:self.overlap_size], 
                    seq.strip("\n")[-self.overlap_size:]) 
                    for seq in self.read_fa() if not seq.startswith(">"))
        self.fa_dict = {seq_name: seq 
                        for seq_name, seq in zip(seq_name_all, seq_all)}
    def is_have_overlap(self, seq_name1, seq_name2):
        flag1 = (self.fa_dict[seq_name1][0] == self.fa_dict[seq_name2][1])  
        flag2 = (self.fa_dict[seq_name1][1] == self.fa_dict[seq_name2][0])
        return flag1 or flag2
    def find_overlap(self):
        comb = itertools.combinations(list(self.fa_dict.keys()), 2)
        for seq_name1, seq_name2 in comb:
            if self.is_have_overlap(seq_name1, seq_name2):
                print(f"{seq_name1}\t{seq_name2}")
    def run(self):
        self.get_fa_dict()
        self.find_overlap()
def main():
    test = OverlapGraphs("/root/py_test/fa.txt", 3)
    test.run()
if __name__ == '__main__':
    main()