Motifs Are Rarely Contiguous
In “Finding a Motif in DNA”, we searched for occurrences of a motif as a substring of a larger database genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see “RNA Splicing”), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.
Problem
A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of T_A_TG_C_TAA_G_ATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.
Given: Two DNA strings s and t (each of length at most 1 kbp) in FASTA format.
Return: One collection of indices of s in which the symbols of t appear as a subsequence of ss. If multiple solutions exist, you may return any one.
Sample Dataset
Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTASample Output
Extra Information
For the mathematically inclined, we may equivalently say that t=t1t2⋯tm is a subsequence of s=s1s2⋯sn if the characters of t appear in the same order within ss. Even more formally, a subsequence of s is a string si1si2…sik, where 1≤i1<i2⋯<ik≤n.
Solution
其实本题就是求解子序列任意一种组合。例如在 ACGTACGTGACG 中找到子序列 GTA ,下图中只列举两个有效的。
子序列.drawio
from typing import Listimport reclass Solution:def subsequence(self, s: str, t: str) -> List[int]:m, n = len(s), len(t)i = j = 0pos = []while i < m and j < n:if s[i] == t[j]: # 找到一个 t[j]j += 1pos.append(i + 1) # 注意:索引+1(从1开始计数)i += 1return pos if j == n else []seqs = """>Rosalind_14ACGTACGTGACG>Rosalind_18GTA"""s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', '')) # 第一个是 `\n`for pos in Solution().subsequence(s, t):print(pos, end=' ')
拓展:如何求出所有子序列组合呢?其实也不难,找到所有 t 字符在 s 中出现位置出现表。例如 t[0] = G 在 s 中出现位置为 [2, 6, 8, 11];t[1] = T 在 s 中出现位置为 [3, 7];t[2] = A 在 s 中出现位置为 [0, 4, 9]。
但是由于子序列要求字符相对位置不能改变,也就是原 t 中字符 在
s 中出现相对顺序 成立。因此所有选择组合就必须是一个递增序列。
枚举子序列所有组合.drawio
from typing import Listimport reclass Solution:def subsequence(self, s: str, t: str) -> List[int]:m, n = len(s), len(t)pos = [[i for i in range(m) if t[j] == s[i]] for j in range(n)]res = []path = []def backtrace(idx: int) -> None:if idx == len(pos):res.append(tuple(path))returnfor j in pos[idx]:if not path or j > path[-1]:path.append(j)backtrace(idx + 1)path.pop()backtrace(0)return resseqs = """>Rosalind_14ACGTACGTGACG>Rosalind_18GTA"""s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', '')) # 第一个是 `\n`for subseq in Solution().subsequence(s, t):print(subseq)
输出:
(2, 3, 4)(2, 3, 9)(2, 7, 9)(6, 7, 9)
