Finding Disjoint Motifs in a Gene

Disjoint Motifs

In this problem, we will consider an algorithmic (but not particularly practical) variant of motif finding for multiple motifs. Say we have two motifs corresponding to the coding regions of genes, and we want to know whether these motifs can be found in genes occupying the same region of the genome. To prevent exons from coinciding, we further stipulate that the two motifs are nonoverlapping.
In this problem, we will ask whether two disjoint motifs can be located in a given string. We considered a similar problem in “Interleaving Two Motifs”, which asked us to find a shortest possible string containing two motifs; however, in that problem, the motifs were allowed to coincide.

Problem

Given three strings ss, tt, and uu, we say that tt and uu can be interwoven into ss if there is some substring of ss made up of tt and uu as disjoint subsequences.
For example, the strings “ACAGACAG” and “CCGCCG” can be interwoven into “GACCACGGTTGACCACGGTT”. However, they cannot be interwoven into “GACCACAAAAGGTTGACCACAAAAGGTT” because of the appearance of the four ‘A’s in the middle of the subsequences. Similarly, even though both “ACACGACACG” is a shortest common supersequence of ACAGACAG and CCGCCG, it is not possible to interweave these two strings into “ACACGACACG” because the two desired subsequences must be disjoint; see “Interleaving Two Motifs” for details on finding a shortest common supersequence of two strings.
Given: A text DNA string ss of length at most 10 kbp, followed by a collection of nn (n≤10n≤10) DNA strings of length at most 10 bp acting as patterns.
Return: An n×nn×n matrix MM for which Mj,k=1Mj,k=1 if the jjth and kkth pattern strings can be interwoven into ss and Mj,k=0Mj,k=0 otherwise.

Sample Dataset

GACCACGGTT
ACAG
GT
CCG

Sample Output

0 0 1
0 1 0
1 0 0

Citation

This problem follows Jones & Pevzner, An Introduction to Bioinformatics Algorithms, Problem 6.31

Solution:动态规划求交织序列

本题重点是求解s1串是否是s2,s3的交织序列。
97. 交错字符串

  1. import sys
  2. from typing import List
  3. stdin = sys.stdin
  4. DNA = stdin.readline().strip()
  5. motifs = [motif.strip() for motif in stdin if motif.strip()]
  6. print(DNA, motifs)
  7. def isInterleave(s1: str, s2: str, s3: str) -> bool:
  8. """s1与s2能否交织构成s3"""
  9. n, m1, m2 = len(s3), len(s1), len(s2)
  10. if m1 + m2 != n:
  11. return False
  12. dp = [[False] * (m2 + 1) for _ in range(m1 + 1)] # 是长度,不是索引
  13. # 状态初始化
  14. dp[0][0] = True # 两者长度都为 0 时为 True
  15. for j in range(1, m2 + 1): # len(s1) = 0,枚举 s2 的长度
  16. dp[0][j] = dp[0][j-1] and s3[j-1] == s2[j-1] # 注意索引
  17. for i in range(1, m1 + 1): # len(s2) = 0,枚举 s1 的长度
  18. dp[i][0] = dp[i-1][0] and s3[i-1] == s1[i-1] # 注意索引
  19. # 状态转移:直接从长度 >= 1 开始转移
  20. for i in range(1, m1 + 1):
  21. for j in range(1, m2 + 1):
  22. p = i + j - 1 # s3 对应字符的索引
  23. dp[i][j] = (s3[p] == s1[i-1] and dp[i-1][j]) or (s3[p] == s2[j-1] and dp[i][j-1])
  24. return dp[m1][m2]
  25. def disjointMotifs(DNA: str, motifs: List[str]) -> int:
  26. """枚举motif两两组合,检测它们是否为DNA某一等长子串的交织序列"""
  27. n = len(motifs)
  28. ans = [['0'] * n for _ in range(n)]
  29. from collections import Counter
  30. # 枚举两两序列对
  31. for i in range(n):
  32. for j in range(i, n):
  33. m = len(motifs[i]) + len(motifs[j])
  34. # 枚举 DNA 长度为 m 的子串 s 是否由 motifs[i]和motifs[j] 交织而成
  35. for k in range(len(DNA) - m + 1):
  36. if isInterleave(motifs[i], motifs[j], DNA[k:k+m]):
  37. ans[i][j] = ans[j][i] = '1'
  38. return ans
  39. for lst in disjointMotifs(DNA, motifs):
  40. print(' '.join(lst))

复杂度分析:设 DNA 串长度为 d,两两motif平均长度为 m1,m2

  • 时间复杂度Finding Disjoint Motifs in a Gene - 图1,主要是在 DNA 中枚举长度为 m1+m2 的子串进行判断是否可由两motif交织而成。(这部分可以先判断字符个数是否相同再进行动态规划)
  • 空间复杂度Finding Disjoint Motifs in a Gene - 图2