动态规划 - Finding a Shared Spliced Motif - 《生物信息》

Locating Motifs Despite Introns
Problem
Sample Dataset
Sample Output
Solution

Locating Motifs Despite Introns

In “Finding a Shared Motif”, we discussed searching through a database containing multiple genetic strings to find a longest common substring of these strings, which served as a motif shared by the two strings. However, as we saw in “RNA Splicing”, coding regions of DNA are often interspersed by introns that do not code for proteins.

We therefore need to locate shared motifs that are separated across exons, which means that the motifs are not required to be contiguous. To model this situation, we need to enlist subsequences.

Problem

A string uu is a common subsequence of strings ss and tt if the symbols of uu appear in order as a subsequence of both ss and tt. For example, “ACTG” is a common subsequence of “AACCTTGG“ and “ACACTGTGA”.
Analogously to the definition of longest common substring, uu is a longest common subsequence of ss and tt if there does not exist a longer common subsequence of the two strings. Continuing our above example, “ACCTTG” is a longest common subsequence of “AACCTTGG” and “ACACTGTGA”, as is “AACTGG”.
Given: Two DNA strings ss and tt (each having length at most 1 kbp) in FASTA format.
Return: A longest common subsequence of ss and tt. (If more than one solution exists, you may return any one.)

Sample Dataset

Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA

Sample Output

AACTGG

Solution

本题让求最长公共子序列，然后输出一条（可能大于等于 1 ）最优解。
语雀内容
下面代码是是用动态规划后使用回溯算法得到 所有最优解 ```python from typing import List class Solution: def longestCommonSubsequence(self, text1: str, text2: str) -> int: m, n = len(text1), len(text2)

    # 1. 填写 dp 表格
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if text1[i-1] == text2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    # 2. 回溯路径输出
    path = []
    res = []
    def backtrace(i: int, j: int) -> None:
        if dp[i][j] == 0:  # 结束
            res.append("".join(reversed(path)))
            return
        if text1[i-1] == text2[j-1]:
            path.append(text1[i-1])
            backtrace(i - 1, j - 1)
            path.pop()
        else:
            if dp[i][j] == dp[i-1][j]:
                backtrace(i - 1, j)
            if dp[i][j] == dp[i][j-1]:
                backtrace(i, j - 1)
    backtrace(m, n)
    return res

seqs = “””

Rosalind_23 AACCTTGG Rosalind_64 ACACTGTGA “”” import re s, t = (seq.replace(‘\n’, ‘’) for seq in re.split(r’>.*’, seqs) if seq.replace(‘\n’, ‘’)) # 第一个是 \n

for subseq in Solution().longestCommonSubsequence(s, t): print(subseq) # AACTGG

输出结果：

AACTTG ACCTTG AACTTG ACCTTG AACTGG ACCTGG

上面回溯时间复杂度太高，当遇到输入字符串太长时，等一千年后再出结果。因此只需要得到一条最佳路径即可。
```python
from typing import List
class Solution:
    def longestCommonSubsequence(self, text1: str, text2: str) -> int:
        m, n = len(text1), len(text2)
        # 1. 填写 dp 表格
        dp = [[0] * (n + 1) for _ in range(m + 1)]
        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if text1[i-1] == text2[j-1]:
                    dp[i][j] = dp[i-1][j-1] + 1
                else:
                    dp[i][j] = max(dp[i-1][j], dp[i][j-1])
        # 2. 回溯路径输出
        path = []
        res = []
        def backtrace(i: int, j: int) -> bool:
            if dp[i][j] == 0:  # 结束
                res.append("".join(reversed(path)))
                return True  # 一次就好
            if text1[i-1] == text2[j-1]:
                path.append(text1[i-1])
                if backtrace(i - 1, j - 1): return True  # 找到一个最优解停止递归
                path.pop()
            else:
                if dp[i][j] == dp[i-1][j]:
                    if backtrace(i - 1, j): return True  # 找到一个最优解停止递归
                if dp[i][j] == dp[i][j-1]:
                    if backtrace(i, j - 1): return True  # 找到一个最优解停止递归
        backtrace(m, n)
        return res
seqs = """
>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA
"""
import re
s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', ''))  # 第一个是 `\n`
for subseq in Solution().longestCommonSubsequence(s, t):
    print(subseq)  # AACTGG