Motifs Are Rarely Contiguous
In “Finding a Motif in DNA”, we searched for occurrences of a motif as a substring of a larger database genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see “RNA Splicing”), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.
Problem
A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of T_A_TG_C_TAA_G_ATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.
Given: Two DNA strings s and t (each of length at most 1 kbp) in FASTA format.
Return: One collection of indices of s in which the symbols of t appear as a subsequence of ss. If multiple solutions exist, you may return any one.
Sample Dataset
Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTASample Output
Extra Information
For the mathematically inclined, we may equivalently say that t=t1t2⋯tm is a subsequence of s=s1s2⋯sn if the characters of t appear in the same order within ss. Even more formally, a subsequence of s is a string si1si2…sik, where 1≤i1<i2⋯<ik≤n.
Solution
其实本题就是求解子序列任意一种组合。例如在 ACGTACGTGACG
中找到子序列 GTA
,下图中只列举两个有效的。
子序列.drawio
from typing import List
import re
class Solution:
def subsequence(self, s: str, t: str) -> List[int]:
m, n = len(s), len(t)
i = j = 0
pos = []
while i < m and j < n:
if s[i] == t[j]: # 找到一个 t[j]
j += 1
pos.append(i + 1) # 注意:索引+1(从1开始计数)
i += 1
return pos if j == n else []
seqs = """
>Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTA
"""
s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', '')) # 第一个是 `\n`
for pos in Solution().subsequence(s, t):
print(pos, end=' ')
拓展:如何求出所有子序列组合呢?其实也不难,找到所有 t
字符在 s
中出现位置出现表。例如 t[0] = G
在 s
中出现位置为 [2, 6, 8, 11]
;t[1] = T
在 s
中出现位置为 [3, 7]
;t[2] = A
在 s
中出现位置为 [0, 4, 9]
。
但是由于子序列要求字符相对位置不能改变,也就是原 t
中字符 在 s
中出现相对顺序 成立。因此所有选择组合就必须是一个递增序列。
枚举子序列所有组合.drawio
from typing import List
import re
class Solution:
def subsequence(self, s: str, t: str) -> List[int]:
m, n = len(s), len(t)
pos = [[i for i in range(m) if t[j] == s[i]] for j in range(n)]
res = []
path = []
def backtrace(idx: int) -> None:
if idx == len(pos):
res.append(tuple(path))
return
for j in pos[idx]:
if not path or j > path[-1]:
path.append(j)
backtrace(idx + 1)
path.pop()
backtrace(0)
return res
seqs = """
>Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTA
"""
s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', '')) # 第一个是 `\n`
for subseq in Solution().subsequence(s, t):
print(subseq)
输出:
(2, 3, 4)
(2, 3, 9)
(2, 7, 9)
(6, 7, 9)