入门题 - Finding a Spliced Motif - 《生物信息》

Motifs Are Rarely Contiguous
Problem
Sample Dataset
Sample Output
Extra Information
Solution

Motifs Are Rarely Contiguous

In “Finding a Motif in DNA”, we searched for occurrences of a motif as a substring of a larger database genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see “RNA Splicing”), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.

Problem

A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of T_A_TG_C_TAA_G_ATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.
Given: Two DNA strings s and t (each of length at most 1 kbp) in FASTA format.
Return: One collection of indices of s in which the symbols of t appear as a subsequence of ss. If multiple solutions exist, you may return any one.

Sample Dataset

Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTA

Sample Output
3 8 10

Extra Information
For the mathematically inclined, we may equivalently say that t=t1t2⋯tm is a subsequence of s=s1s2⋯sn if the characters of t appear in the same order within ss. Even more formally, a subsequence of s is a string si1si2…sik, where 1≤i1<i2⋯<ik≤n.

Solution

其实本题就是求解子序列任意一种组合。例如在 ACGTACGTGACG 中找到子序列 GTA ，下图中只列举两个有效的。
子序列.drawio

from typing import List
import re
class Solution:
    def subsequence(self, s: str, t: str) -> List[int]:
        m, n = len(s), len(t)
        i = j = 0
        pos = []
        while i < m and j < n:
            if s[i] == t[j]:  # 找到一个 t[j]
                j += 1
                pos.append(i + 1)  # 注意：索引+1(从1开始计数)
            i += 1
        return pos if j == n else []
seqs = """
>Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTA
""" 
s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', ''))  # 第一个是 `\n`
for pos in Solution().subsequence(s, t):
    print(pos, end=' ')

拓展：如何求出所有子序列组合呢？其实也不难，找到所有 t 字符在 s 中出现位置出现表。例如 t[0] = G 在 s 中出现位置为 [2, 6, 8, 11]；t[1] = T 在 s 中出现位置为 [3, 7]；t[2] = A 在 s 中出现位置为 [0, 4, 9]。

但是由于子序列要求字符相对位置不能改变，也就是原 t 中字符 Finding a Spliced Motif - 图1 在 s 中出现相对顺序 Finding a Spliced Motif - 图2 成立。因此所有选择组合就必须是一个递增序列。
枚举子序列所有组合.drawio

from typing import List
import re
class Solution:
    def subsequence(self, s: str, t: str) -> List[int]:
        m, n = len(s), len(t)
        pos = [[i for i in range(m) if t[j] == s[i]] for j in range(n)]
        res = []
        path = []
        def backtrace(idx: int) -> None:
            if idx == len(pos):
                res.append(tuple(path))
                return
            for j in pos[idx]:
                if not path or j > path[-1]:
                    path.append(j)
                    backtrace(idx + 1)
                    path.pop()
        backtrace(0)
        return res
seqs = """
>Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTA
"""
s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', ''))  # 第一个是 `\n`
for subseq in Solution().subsequence(s, t):
    print(subseq)

输出：

(2, 3, 4)
(2, 3, 9)
(2, 7, 9)
(6, 7, 9)