Genome Sequencing Isn’t Perfect

In “Genome Assembly as Shortest Superstring”, we introduce the problem of assembling a genome from a collection of reads. Even though genome sequencing is a multi-billion dollar enterprise, sequencing machines that identify reads still produce errors a substantial percentage of the time. To make matters worse, these errors are unpredictable; it is difficult to determine if the machine has made an error, let alone where in the read the error has occurred. For this reason, error correction in reads is typically a vital first step in genome assembly.

Problem

As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:

  • s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
  • s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).

Return: A list of all corrections in the form “[old read]->[new read]”. (Each correction must be a single symbol substitution, and you may return the corrections in any order.)

Sample Dataset

Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC

Sample Output

TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA

Solution

```python from typing import List, Tuple, Set import re

class Solution: def reverseComplement(self, seq: str) -> str: table = dict(zip(‘ATCG’, ‘TAGC’)) return ‘’.join(table[base] for base in reversed(seq))

  1. def correctError(self, reads: List[str]) -> List[str]:
  2. """将错误的 reads 通过以下规则修改为正确 reads
  3. 正确 reads:其自身或者反向互补链在所有 reads 中至少出现两次
  4. 错误 reads:通过改变自身一个碱基(汉明距离变化为 1)得到新 read,
  5. 若其或其反向互补序列 在 正确 reads 中则成功纠正。
  6. """
  7. from collections import Counter
  8. # 1. 统计正确/错误 reads
  9. reads = Counter(reads)
  10. correctReads = set()
  11. incorrectReads = set()
  12. for read, cnt in reads.items():
  13. if cnt == 1 and self.reverseComplement(read) not in reads:
  14. incorrectReads.add(read)
  15. else:
  16. correctReads.add(read)
  17. # 2. 错误 read 在 正确 reads 中找到汉明距离为 1 就可以纠正
  18. correctInfo = []
  19. for icr in incorrectReads:
  20. seq = list(icr)
  21. for i in range(len(seq)):
  22. # 每次改变 seq[i] 上的碱基,查找新 read 或新 read 反向互补序列是否在正确 Reads 中
  23. flag = False
  24. originChar = seq[i]
  25. for base in 'ATCG':
  26. if base == originChar: continue
  27. seq[i] = base # 修改
  28. newseq = ''.join(seq)
  29. if newseq in correctReads or self.reverseComplement(newseq) in correctReads:
  30. correctInfo.append(f'{icr}->{newseq}')
  31. flag = True
  32. break
  33. # 当前位置 seq[i] 已经找到就不需要查看 seq[i+1,n-1] ,因此 break
  34. if flag: break
  35. seq[i] = originChar # 恢复
  36. return correctInfo

fasta = “””

Rosalind_52 TCATC Rosalind_44 TTCAT Rosalind_68 TCATC Rosalind_28 TGAAA Rosalind_95 GAGGA Rosalind_66 TTTCA Rosalind_33 ATCAA Rosalind_21 TTGAT Rosalind_18 TTTCC “”” reads = [mat.group(2).replace(‘\n’, ‘’) for mat in re.finditer(r’>(.)\n((?s:[^>]))’, fasta)] for ret in Solution().correctError(reads): print(ret) `` **复杂度分析**:设共有nreads,且read长度都为k,其中错误readsm` 条。

  • 时间复杂度:,一共分为两个步骤
    • 判断正确性:对 reads 使用哈希计数器进行正确性判断需要 ;
    • 纠错:枚举错误 reads 需要 且错误 read 逐位枚举 3 中可能情况后再查正确 reads 表需要 ,因此每条错误 read 纠正需要 时间。
  • 时间复杂度Error Correction in Reads - 图1,三个哈希表。