Motifs Are Rarely Contiguous

In “Finding a Motif in DNA”, we searched for occurrences of a motif as a substring of a larger database genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see “RNA Splicing”), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.

Problem

A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of T_A_TG_C_TAA_G_ATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.
Given: Two DNA strings s and t (each of length at most 1 kbp) in FASTA format.
Return: One collection of indices of s in which the symbols of t appear as a subsequence of ss. If multiple solutions exist, you may return any one.

Sample Dataset

Rosalind_14
ACGTACGTGACG
>Rosalind_18
GTA

Sample Output

3 8 10

Extra Information

For the mathematically inclined, we may equivalently say that t=t1t2⋯tm is a subsequence of s=s1s2⋯sn if the characters of t appear in the same order within ss. Even more formally, a subsequence of s is a string si1si2…sik, where 1≤i1<i2⋯<ik≤n.

Solution

其实本题就是求解子序列任意一种组合。例如在 ACGTACGTGACG 中找到子序列 GTA ,下图中只列举两个有效的。
子序列.drawio

  1. from typing import List
  2. import re
  3. class Solution:
  4. def subsequence(self, s: str, t: str) -> List[int]:
  5. m, n = len(s), len(t)
  6. i = j = 0
  7. pos = []
  8. while i < m and j < n:
  9. if s[i] == t[j]: # 找到一个 t[j]
  10. j += 1
  11. pos.append(i + 1) # 注意:索引+1(从1开始计数)
  12. i += 1
  13. return pos if j == n else []
  14. seqs = """
  15. >Rosalind_14
  16. ACGTACGTGACG
  17. >Rosalind_18
  18. GTA
  19. """
  20. s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', '')) # 第一个是 `\n`
  21. for pos in Solution().subsequence(s, t):
  22. print(pos, end=' ')

拓展:如何求出所有子序列组合呢?其实也不难,找到所有 t 字符在 s 中出现位置出现表。例如 t[0] = Gs 中出现位置为 [2, 6, 8, 11]t[1] = Ts 中出现位置为 [3, 7]t[2] = As 中出现位置为 [0, 4, 9]

但是由于子序列要求字符相对位置不能改变,也就是原 t 中字符 Finding a Spliced Motif - 图1s 中出现相对顺序 Finding a Spliced Motif - 图2成立。因此所有选择组合就必须是一个递增序列
枚举子序列所有组合.drawio

  1. from typing import List
  2. import re
  3. class Solution:
  4. def subsequence(self, s: str, t: str) -> List[int]:
  5. m, n = len(s), len(t)
  6. pos = [[i for i in range(m) if t[j] == s[i]] for j in range(n)]
  7. res = []
  8. path = []
  9. def backtrace(idx: int) -> None:
  10. if idx == len(pos):
  11. res.append(tuple(path))
  12. return
  13. for j in pos[idx]:
  14. if not path or j > path[-1]:
  15. path.append(j)
  16. backtrace(idx + 1)
  17. path.pop()
  18. backtrace(0)
  19. return res
  20. seqs = """
  21. >Rosalind_14
  22. ACGTACGTGACG
  23. >Rosalind_18
  24. GTA
  25. """
  26. s, t = (seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', '')) # 第一个是 `\n`
  27. for subseq in Solution().subsequence(s, t):
  28. print(subseq)

输出:

  1. (2, 3, 4)
  2. (2, 3, 9)
  3. (2, 7, 9)
  4. (6, 7, 9)