Genome Sequencing Isn’t Perfect
In “Genome Assembly as Shortest Superstring”, we introduce the problem of assembling a genome from a collection of reads. Even though genome sequencing is a multi-billion dollar enterprise, sequencing machines that identify reads still produce errors a substantial percentage of the time. To make matters worse, these errors are unpredictable; it is difficult to determine if the machine has made an error, let alone where in the read the error has occurred. For this reason, error correction in reads is typically a vital first step in genome assembly.
Problem
As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:
- s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
- s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).
Return: A list of all corrections in the form “[old read]->[new read]”. (Each correction must be a single symbol substitution, and you may return the corrections in any order.)
Sample Dataset
Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCCSample Output
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCASolution
```python from typing import List, Tuple, Set import re
class Solution: def reverseComplement(self, seq: str) -> str: table = dict(zip(‘ATCG’, ‘TAGC’)) return ‘’.join(table[base] for base in reversed(seq))
def correctError(self, reads: List[str]) -> List[str]:"""将错误的 reads 通过以下规则修改为正确 reads正确 reads:其自身或者反向互补链在所有 reads 中至少出现两次错误 reads:通过改变自身一个碱基(汉明距离变化为 1)得到新 read,若其或其反向互补序列 在 正确 reads 中则成功纠正。"""from collections import Counter# 1. 统计正确/错误 readsreads = Counter(reads)correctReads = set()incorrectReads = set()for read, cnt in reads.items():if cnt == 1 and self.reverseComplement(read) not in reads:incorrectReads.add(read)else:correctReads.add(read)# 2. 错误 read 在 正确 reads 中找到汉明距离为 1 就可以纠正correctInfo = []for icr in incorrectReads:seq = list(icr)for i in range(len(seq)):# 每次改变 seq[i] 上的碱基,查找新 read 或新 read 反向互补序列是否在正确 Reads 中flag = FalseoriginChar = seq[i]for base in 'ATCG':if base == originChar: continueseq[i] = base # 修改newseq = ''.join(seq)if newseq in correctReads or self.reverseComplement(newseq) in correctReads:correctInfo.append(f'{icr}->{newseq}')flag = Truebreak# 当前位置 seq[i] 已经找到就不需要查看 seq[i+1,n-1] ,因此 breakif flag: breakseq[i] = originChar # 恢复return correctInfo
fasta = “””
Rosalind_52 TCATC Rosalind_44 TTCAT Rosalind_68 TCATC Rosalind_28 TGAAA Rosalind_95 GAGGA Rosalind_66 TTTCA Rosalind_33 ATCAA Rosalind_21 TTGAT Rosalind_18 TTTCC “”” reads = [mat.group(2).replace(‘\n’, ‘’) for mat in re.finditer(r’>(.)\n((?s:[^>]))’, fasta)] for ret in Solution().correctError(reads): print(ret)
`` **复杂度分析**:设共有n条reads,且read长度都为k,其中错误reads有m` 条。
- 时间复杂度:,一共分为两个步骤
- 判断正确性:对
reads使用哈希计数器进行正确性判断需要 ; - 纠错:枚举错误
reads需要 且错误read逐位枚举3中可能情况后再查正确reads表需要 ,因此每条错误read纠正需要 时间。
- 判断正确性:对
- 时间复杂度:
,三个哈希表。
