Genome Sequencing Isn’t Perfect
In “Genome Assembly as Shortest Superstring”, we introduce the problem of assembling a genome from a collection of reads. Even though genome sequencing is a multi-billion dollar enterprise, sequencing machines that identify reads still produce errors a substantial percentage of the time. To make matters worse, these errors are unpredictable; it is difficult to determine if the machine has made an error, let alone where in the read the error has occurred. For this reason, error correction in reads is typically a vital first step in genome assembly.
Problem
As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:
- s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
- s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).
Return: A list of all corrections in the form “[old read]->[new read]”. (Each correction must be a single symbol substitution, and you may return the corrections in any order.)
Sample Dataset
Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCCSample Output
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCASolution
```python from typing import List, Tuple, Set import re
class Solution: def reverseComplement(self, seq: str) -> str: table = dict(zip(‘ATCG’, ‘TAGC’)) return ‘’.join(table[base] for base in reversed(seq))
def correctError(self, reads: List[str]) -> List[str]:
"""将错误的 reads 通过以下规则修改为正确 reads
正确 reads:其自身或者反向互补链在所有 reads 中至少出现两次
错误 reads:通过改变自身一个碱基(汉明距离变化为 1)得到新 read,
若其或其反向互补序列 在 正确 reads 中则成功纠正。
"""
from collections import Counter
# 1. 统计正确/错误 reads
reads = Counter(reads)
correctReads = set()
incorrectReads = set()
for read, cnt in reads.items():
if cnt == 1 and self.reverseComplement(read) not in reads:
incorrectReads.add(read)
else:
correctReads.add(read)
# 2. 错误 read 在 正确 reads 中找到汉明距离为 1 就可以纠正
correctInfo = []
for icr in incorrectReads:
seq = list(icr)
for i in range(len(seq)):
# 每次改变 seq[i] 上的碱基,查找新 read 或新 read 反向互补序列是否在正确 Reads 中
flag = False
originChar = seq[i]
for base in 'ATCG':
if base == originChar: continue
seq[i] = base # 修改
newseq = ''.join(seq)
if newseq in correctReads or self.reverseComplement(newseq) in correctReads:
correctInfo.append(f'{icr}->{newseq}')
flag = True
break
# 当前位置 seq[i] 已经找到就不需要查看 seq[i+1,n-1] ,因此 break
if flag: break
seq[i] = originChar # 恢复
return correctInfo
fasta = “””
Rosalind_52 TCATC Rosalind_44 TTCAT Rosalind_68 TCATC Rosalind_28 TGAAA Rosalind_95 GAGGA Rosalind_66 TTTCA Rosalind_33 ATCAA Rosalind_21 TTGAT Rosalind_18 TTTCC “”” reads = [mat.group(2).replace(‘\n’, ‘’) for mat in re.finditer(r’>(.)\n((?s:[^>]))’, fasta)] for ret in Solution().correctError(reads): print(ret)
`` **复杂度分析**:设共有
n条
reads,且
read长度都为
k,其中错误
reads有
m` 条。
- 时间复杂度:,一共分为两个步骤
- 判断正确性:对
reads
使用哈希计数器进行正确性判断需要 ; - 纠错:枚举错误
reads
需要 且错误read
逐位枚举3
中可能情况后再查正确reads
表需要 ,因此每条错误read
纠正需要 时间。
- 判断正确性:对
- 时间复杂度:,三个哈希表。