动态规划 - Finding a Shared Motif - 《生物信息》

Searching Through the Haystack
Problem
Sample Dataset
Sample Output
Solution1：两两求最长公共子串【错误：当两串的lcs不止一个】

Searching Through the Haystack

In “Finding a Motif in DNA”, we searched a given genetic string for a motif; however, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.
The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database; such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.

Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.
Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.
Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.
Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

Sample Output

AC

Solution1：两两求最长公共子串【错误：当两串的lcs不止一个】

其实本例就是一个求 Finding a Shared Motif - 图1 串最长公共子串的问题，普通方法就是两串对比后再和之后的进行对比，即两串对比进行 Finding a Shared Motif - 图2 次。如下面程序用动态规划求两串的思想时间复杂度为 Finding a Shared Motif - 图3 ，其中 w 是最长公共子串的长度。

class Solution:
    def readFasta(self, fileName: str) -> List[str]:
        seqs = []
        with open(fileName) as f:
            seq = []
            for line in f:
                line = line.strip()  # withespace
                if line.startswith('>'):  # ID
                    if seq:  # last sequence finished
                        seqs.append(''.join(seq))
                        seq = []
                elif line:  # bases
                    seq.append(line.strip())
            # last sequence
            if seq:
                seqs.append(''.join(seq))
        return seqs
    def solve(self) -> str:
        def findLCS(s: str, t: str) -> str:
            m, n = len(s), len(t)
            dp = [[0] * (n + 1) for _ in range(m + 1)]
            max_len = 0
            start_idx = -1
            for i in range(m - 1, -1, -1):
                for j in range(n - 1, -1, -1):
                    if s[i] == t[j]:
                        dp[i][j] = dp[i + 1][j + 1] + 1
                    else:
                        dp[i][j] = 0
                    if dp[i][j] > max_len:
                        max_len = dp[i][j]
                        start_idx = i
            # print("lcs-{},{}:{}".format(s, t, s[start_idx:start_idx+max_len]))
            return s[start_idx:start_idx + max_len]
        seqs = self.readFasta('./rosalind_lcsm.txt')
        return reduce(findLCS, seqs)
print(Solution().solve())