Introduction to Genome Sequencing**

Recall from “Computing GC Content” that almost all humans share approximately 99.9% of the same nucleotides on the genome. Thus, if we know only a few complete genomes from the species, then we already possess an important key to unlocking the species genome.

Determining an organism’s complete genome (called genome sequencing) forms a central task of bioinformatics. Unfortunately, we still don’t possess the microscope technology to zoom into the nucleotide level and determine the sequence of a genome’s nucleotides, one at a time. However, researchers can apply chemical methods to generate and identify much smaller snippets of DNA, called reads.
After obtaining a large collection of reads from multiple copies of the same genome, the aim is to reconstruct the desired genome from these small pieces of DNA; this process is called fragment assembly (see Figure 1).
Genome Assembly as Shortest Superstring - 图1
Figure 1. Fragment Assembly works by blasting many copies of the same genome into smaller, identifiable reads, which are then used to computationally assemble one copy of the genome.

Problem

For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring.
By the assumption of parsimony, a shortest possible superstring over a collection of reads serves as a candidate chromosome.
Given: At most 50 DNA strings of approximately equal length, not exceeding 1 kbp, in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).
The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.
Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).

Sample Dataset

Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59
GCCGGAATAC

Sample Output

ATTAGACCTGCCGGAATAC

Extra Information

Although the goal of fragment assembly is to produce an entire genome, in practice it is only possible to construct several contiguous portions of each chromosome, called contigs. Furthermore, the assumption made above that reads all derive from the same strand is also practically unrealistic; in reality, researchers will not know the strand of DNA from which a given read has been sequenced.

Solution

主要讲基因组测序后如何组装,组装的过程大致是什么。题目是真实情况的简化,也就是给定数据输出结果一定唯一。但是在实际中多条 reads 的重叠部分不可能是唯一的,即不是题中限制多条 reads 重叠区域大于等于其长度一半。

算法流程:根据不同 readsoverlap 程度构建出最终的基因组。

  1. ATTAGACCTGCCGGAATAC
  2. ATTAGACCTG >Rosalind_56
  3. CCTGCCGGAA >Rosalind_57
  4. AGACCTGCCG >Rosalind_58
  5. GCCGGAATAC >Rosalind_59

将上面按照起点进行升序排序就更明白题意要求:

  1. ATTAGACCTGCCGGAATAC
  2. ATTAGACCTG >Rosalind_56
  3. AGACCTGCCG >Rosalind_58
  4. CCTGCCGGAA >Rosalind_57
  5. GCCGGAATAC >Rosalind_59

现在看是否更加清晰呢?但是我们如何构造出最终结果呢? :::tips 题目限制最终结果是唯一和其只能来源于不同 reads 重叠序列大于等于其自身长度一半的原因?

  1. 确定唯一性的目的是简化真实的基因组组装,也就是将题目简单化。
  2. 如何保证最终结果是唯一的?

    1. 通过限制不同 reads 之间 overlap 至少为其自身长度一半才可以构建最终基因组:可以减少很多可能。
    2. 题目输入序列之间不会出现大量 overlap :最终拼接结果中处于两段的 reads 是确定的,处于中间的 reads 会与其左侧 reads 和右侧 readsoverlap 大于等于其自身长度一半(除此之外不会与其它 readsoverlap 大于等于其自身长度一半)。这样就能保证最终拼接方向(从哪里延申)和唯一性。 :::
  3. 计算不同 reads 对之间的 overlap 长度。例如 lcs(56, 57), lcs(56, 58), lcs(56, 59) 等。由于题目保证最终答案是唯一的,且由 临近 (上述按起点排序) reads 最大overlap 拼接而成。因此我们只要找到起点/终点就可以顺/逆推得到最终答案。

  4. 如何确定起点/终点:首先对每个 reads 和其余 reads 进行最长公共子串处理
    1. 如果该 reads 和其它 reads 最长公共子序列长度大于等于自身长度一半的个数等于 **2** ,则其是中间 reads
    2. 如果该 reads 和其它 reads 最长公共子序列长度大于等于自身长度一半的个数等于 **1** ,则其是起点终点。根据最长 lcs 在该 reads 起始位置可以判断出是起点(endswith)还是终点(startswith)。

算法流程:

  1. 求解不同 reads 之间 lcs 的值的所有组合 Genome Assembly as Shortest Superstring - 图2
  2. 根据 Genome Assembly as Shortest Superstring - 图3Genome Assembly as Shortest Superstring - 图4 的个数确定起点终点
  3. 根据起点/终点进行延申,不同 reads 延申条件是 Genome Assembly as Shortest Superstring - 图5,依次递推到另一个终点结束。 ```python from typing import List, Tuple import re

class Solution: def extern(self, ri: int, reads: List[str], alignments: List[List[Tuple[int]]], visited: List[bool]) -> str: “””根据当前 alignments[ri] 比对信息,挑选出 overlap >= k / 2 进行向左/右延申””” readI = reads[ri] visited[ri] = True for rj in range(len(alignments[ri])): lcsLen, endi, endj = alignments[ri][rj] if lcsLen * 2 >= len(readI): if not visited[rj]: ret = self.extern(rj, reads, alignments, visited)

  1. # reads[rj] 在 reads[ri] 右侧
  2. # reads[ri]: ---*******
  3. # reads[rj]: *******---
  4. # 我们需要问号部分 *******????????
  5. if endi > endj:
  6. readI = readI + ret[endj + 1:]
  7. # reads[rj] 在 reads[ri] 左侧
  8. # reads[ri]: *******---
  9. # reads[rj]: ---*******
  10. # ????????******* 需要问号部分
  11. else:
  12. readI = ret[0:-endj] + readI
  13. return readI
  14. def shortestSuperstring(self, reads: List[str]) -> str:
  15. """根据不同 reads 的 overlap 程度拼接最终基因组"""
  16. k = len(reads[0])
  17. alignments = [[] for _ in range(len(reads))]
  18. for ri in range(len(reads)):
  19. for rj in range(len(reads)):
  20. if ri == rj:
  21. alignments[ri].append((0, -1, -1)) # 添加 seqs[ri] 与 seqs[rj] lcs 比对信息
  22. continue
  23. # 使用 dp 求解
  24. maxLcs, endi, endj = 0, 0, 0
  25. m, n = len(seqs[ri]), len(seqs[rj])
  26. dp = [[0] * (n + 1) for _ in range(m + 1)]
  27. for i in range(1, m + 1):
  28. for j in range(1, n + 1):
  29. if seqs[ri][i-1] == seqs[rj][j-1]:
  30. dp[i][j] = dp[i-1][j-1] + 1
  31. if dp[i][j] > maxLcs:
  32. maxLcs, endi, endj = dp[i][j], i - 1, j - 1
  33. # 添加 seqs[ri] 与 seqs[rj] lcs 比对信息
  34. alignments[ri].append((maxLcs, endi, endj))
  35. # 2. 寻找起点
  36. return self.extern(0, reads, alignments, [False] * len(reads))

fasta = “””

Rosalind_56 ATTAGACCTG Rosalind_57 CCTGCCGGAA Rosalind_58 AGACCTGCCG Rosalind_59 GCCGGAATAC “””

seqs = [mat.group(2).replace(‘\n’, ‘’) for mat in re.finditer(r’>(.)\n((?s:[^>]))’, fasta)] print(Solution().shortestSuperstring(seqs))

  1. 另一个算法超时:
  2. ```python
  3. from typing import List, Tuple
  4. import re
  5. class Solution:
  6. def longestCommonSubstr(self, s: str, t: str) -> Tuple[int]:
  7. """求解 s, t 的最长公共子串,返回 (maxLcs, endI, endJ)"""
  8. # 使用 dp 求解
  9. maxLcs, endi, endj = 0, 0, 0
  10. m, n = len(s), len(t)
  11. dp = [[0] * (n + 1) for _ in range(m + 1)]
  12. for i in range(1, m + 1):
  13. for j in range(1, n + 1):
  14. if s[i-1] == t[j-1]:
  15. dp[i][j] = dp[i-1][j-1] + 1
  16. if dp[i][j] > maxLcs:
  17. maxLcs, endi, endj = dp[i][j], i - 1, j - 1
  18. return (maxLcs, endi, endj)
  19. def shortestSuperstring(self, reads: List[str]) -> str:
  20. """根据不同 reads 的 overlap 程度拼接最终基因组"""
  21. n = len(reads)
  22. k = len(reads[0])
  23. genome = reads[0]
  24. novisited = [True] * n
  25. novisited[0] = False
  26. while any(novisited):
  27. maxLcs, endi, endj, rj = 0, 0, 0, -1
  28. for j in range(n):
  29. if novisited[j]: # 没被看
  30. _maxLcs, _endi, _endj = self.longestCommonSubstr(genome, reads[j])
  31. if _maxLcs > maxLcs:
  32. maxLcs, endi, endj, rj = _maxLcs, _endi, _endj, j
  33. novisited[rj] = False
  34. if endj == k - 1: # reads[rj] + genome
  35. genome = reads[rj][:-endj] + genome
  36. else: # genome ,reads[rj]
  37. genome = genome + reads[rj][endj+1:]
  38. return genome
  39. fasta = """
  40. >Rosalind_56
  41. ATTAGACCTG
  42. >Rosalind_57
  43. CCTGCCGGAA
  44. >Rosalind_58
  45. AGACCTGCCG
  46. >Rosalind_59
  47. GCCGGAATAC
  48. """
  49. seqs = [mat.group(2).replace('\n', '')
  50. for mat in re.finditer(r'>(.*)\n((?s:[^>]*))', fasta)]
  51. print(Solution().shortestSuperstring(seqs))

复杂度分析:设 reads 数量为 n ,单条 reads 长度为 kgenome 长度为 m (随着合并变长)

  • 时间复杂度
    • 最外层 while 合并消耗 Genome Assembly as Shortest Superstring - 图6 次;
    • 内层循环次数与剩余合并数量有关,设最外层循环次数为 i ,则内层循环走 n - 1 - i
    • lcs 计算要 Genome Assembly as Shortest Superstring - 图7,且随着 genome 长度变大,即 m 变大后每次执行 lcs 时间复杂度变高。
    • 综上总的时间复杂度为 Genome Assembly as Shortest Superstring - 图8
  • 空间复杂度Genome Assembly as Shortest Superstring - 图9,求解 lcs 问题的创建 dp 数组。

其实上面算法已经接近正确答案,但是问题出在重复进行多次 lcs,因为若当前 genome 和一个 reads 重叠长度大于等于 Genome Assembly as Shortest Superstring - 图10 就可以进行合并,但是上述算法依旧和其它 reads 继续做 lcs ,最后选择最优进行合并(这种适合在实践中进行,因为本题给的用例比较特殊)。

鉴于每次只需要找到某条 readsgenome前后缀重叠长度范围在 Genome Assembly as Shortest Superstring - 图11 即可,因此中间部分的匹配可以省略(尤其是 genome 长度不断变长后复杂度剧增时优化非常明显,因为最后 Genome Assembly as Shortest Superstring - 图12),则总时间复杂度为 Genome Assembly as Shortest Superstring - 图13

  1. from typing import List
  2. import re
  3. class Solution:
  4. def shortestSuperstring(self, reads: List[str]) -> str:
  5. """根据不同 reads 的 overlap 程度拼接最终基因组"""
  6. n = len(reads)
  7. genome = reads[0]
  8. visited = [False] * n
  9. visited[0] = True
  10. for _ in range(n - 1):
  11. for j in range(n):
  12. if not visited[j]: # 没被看
  13. visited[j] = True
  14. m = len(reads[j])
  15. for length in range(m // 2, m + 1):
  16. # genome ****---
  17. # reads[j] ---****
  18. if genome[:length] == reads[j][-length:]:
  19. genome = reads[j][0:-length] + genome
  20. break
  21. # genome ---****
  22. # reads[j] ****---
  23. if genome[-length:] == reads[j][:length]:
  24. genome = genome + reads[j][length:]
  25. break
  26. else: # 无 overlap 候选
  27. visited[j] = False
  28. return genome
  29. fasta = """
  30. >Rosalind_56
  31. ATTAGACCTG
  32. >Rosalind_57
  33. CCTGCCGGAA
  34. >Rosalind_58
  35. AGACCTGCCG
  36. >Rosalind_59
  37. GCCGGAATAC
  38. """
  39. seqs = list(seq.replace('\n', '') for seq in re.findall('[ATCG\n]+', fasta) if seq.replace('\n', ''))
  40. ret = Solution().shortestSuperstring(seqs)
  41. print(ret)

复杂度分析:设 reads 数量为 n ,单条 reads 长度为 kgenome 长度为 m (随着合并变长)

  • 时间复杂度
    • 最外层 while 合并消耗 Genome Assembly as Shortest Superstring - 图14 次;
    • 内层循环次数与剩余合并数量和当前 genome 是否“匹配”一个 reads 有关,设最外层循环次数为 i ,最坏情况下内层循环走 n - 1 - i 次,均摊为 Genome Assembly as Shortest Superstring - 图15,因为最坏情况下只可能遇到一次。
    • overlap 计算要 Genome Assembly as Shortest Superstring - 图16,且与 genome 长度无关。
    • 综上总的时间复杂度为 Genome Assembly as Shortest Superstring - 图17
  • 空间复杂度Genome Assembly as Shortest Superstring - 图18,不计算切片消耗。