The Billion-Year War

The war between viruses and bacteria has been waged for over a billion years. Viruses called bacteriophages (or simply phages) require a bacterial host to propagate, and so they must somehow infiltrate the bacterium; such deception can only be achieved if the phage understands the genetic framework underlying the bacterium’s cellular functions. The phage’s goal is to insert DNA that will be replicated within the bacterium and lead to the reproduction of as many copies of the phage as possible, which sometimes also involves the bacterium’s demise.

To defend itself, the bacterium must either obfuscate its cellular functions so that the phage cannot infiltrate it, or better yet, go on the counterattack by calling in the air force. Specifically, the bacterium employs aerial scouts called restriction enzymes, which operate by cutting through viral DNA to cripple the phage. But what kind of DNA are restriction enzymes looking for?

The restriction enzyme is a homodimer, which means that it is composed of two identical substructures. Each of these structures separates from the restriction enzyme in order to bind to and cut one strand of the phage DNA molecule; both substructures are pre-programmed with the same target string containing 4 to 12 nucleotides to search for within the phage DNA (see Figure 1.). The chance that both strands of phage DNA will be cut (thus crippling the phage) is greater if the target is located on both strands of phage DNA, as close to each other as possible. By extension, the best chance of disarming the phage occurs when the two target copies appear directly across from each other along the phage DNA, a phenomenon that occurs precisely when the target is equal to its own reverse complement. Eons of evolution have made sure that most restriction enzyme targets now have this form. Locating Restriction Sites - 图1
Figure 1. DNA cleaved by EcoRV restriction enzyme

Problem

A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC. See Figure 2.** Locating Restriction Sites - 图2
Figure 2. Palindromic recognition site
Given: A DNA string of length at most 1 kbp in FASTA format.
Return: The position and length of every reverse palindrome in the string having length between 4 and 12. You may return these pairs in any order.

Sample Dataset

Rosalind_24
TCAATGCATGCGGGTCTATATGCAT

Sample Output

4 6
5 4
6 6
7 4
17 4
18 4
20 6
21 4

Extra Information

You may be curious how the bacterium prevents its own DNA from being cut by restriction enzymes. The short answer is that it locks itself from being cut through a chemical process called DNA methylation.

解题思路:

求解区间 Locating Restriction Sites - 图3 是否是反向回文序列,然后枚举起点 Locating Restriction Sites - 图4,求得长度为 Locating Restriction Sites - 图5(其中 k 必须是偶数,例如 CAATG 就不算!因为反向回文定义是和互补链呈镜像,也就是中心轴不能在碱基上!)

Solution 1: brute force

暴力求解区间区间 Locating Restriction Sites - 图6 是否构成反向回文,使用 python 直接一行搞定:Locating Restriction Sites - 图7

Solution 1: dynamic programming

使用动态规划快速求解区间 Locating Restriction Sites - 图8 是否构成反向回文。

  1. from typing import List
  2. class Solution:
  3. def reversePalindrome(self, dna: str) -> List[List[int]]:
  4. n = len(dna)
  5. bases = dict(zip('ATCG', 'TAGC'))
  6. # 1. 处理区间 [i,j] 是否是回文
  7. dp = [[True] * n for _ in range(n)]
  8. for i in range(n - 1, -1, -1):
  9. for j in range(i + 1, n):
  10. dp[i][j] = (bases[dna[i]] == dna[j]) and dp[i+1][j-1]
  11. # 2. 枚举起点 start, 长度 4 <= k <= 12
  12. res = []
  13. lo, hi = 4, 12
  14. for start in range(n - lo + 1):
  15. for k in range(lo, hi + 1, 2):
  16. end = start + k - 1
  17. if end < n and dp[start][end]:
  18. res.append([start + 1, k])
  19. return res
  20. seqs = """
  21. >Rosalind_24
  22. TCAATGCATGCGGGTCTATATGCAT
  23. """
  24. import re
  25. s = list(seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', ''))[0] # 第一个是 `\n`
  26. for start, end in Solution().reversePalindrome(s):
  27. print(start, end)

Solution 3: 马拉车线性算法