图论 - Error Correction in Reads - 《生物信息》

Genome Sequencing Isn’t Perfect
Problem
Sample Dataset
Sample Output
Solution

Genome Sequencing Isn’t Perfect

In “Genome Assembly as Shortest Superstring”, we introduce the problem of assembling a genome from a collection of reads. Even though genome sequencing is a multi-billion dollar enterprise, sequencing machines that identify reads still produce errors a substantial percentage of the time. To make matters worse, these errors are unpredictable; it is difficult to determine if the machine has made an error, let alone where in the read the error has occurred. For this reason, error correction in reads is typically a vital first step in genome assembly.

Problem

As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:

s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).

Return: A list of all corrections in the form “[old read]->[new read]”. (Each correction must be a single symbol substitution, and you may return the corrections in any order.)

Sample Dataset

Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC

Sample Output
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA

Solution
```python from typing import List, Tuple, Set import re

class Solution: def reverseComplement(self, seq: str) -> str: table = dict(zip(‘ATCG’, ‘TAGC’)) return ‘’.join(table[base] for base in reversed(seq))

def correctError(self, reads: List[str]) -> List[str]:
    """将错误的 reads 通过以下规则修改为正确 reads
    正确 reads：其自身或者反向互补链在所有 reads 中至少出现两次
    错误 reads：通过改变自身一个碱基(汉明距离变化为 1)得到新 read，
        若其或其反向互补序列 在 正确 reads 中则成功纠正。
    """
    from collections import Counter
    # 1. 统计正确/错误 reads
    reads = Counter(reads)
    correctReads = set()
    incorrectReads = set()
    for read, cnt in reads.items():
        if cnt == 1 and self.reverseComplement(read) not in reads:
            incorrectReads.add(read)
        else:
            correctReads.add(read)
    # 2. 错误 read 在 正确 reads 中找到汉明距离为 1 就可以纠正
    correctInfo = []
    for icr in incorrectReads:
        seq = list(icr)
        for i in range(len(seq)):
            # 每次改变 seq[i] 上的碱基，查找新 read 或新 read 反向互补序列是否在正确 Reads 中
            flag = False
            originChar = seq[i]
            for base in 'ATCG':
                if base == originChar: continue
                seq[i] = base  # 修改
                newseq = ''.join(seq)
                if newseq in correctReads or self.reverseComplement(newseq) in correctReads:
                    correctInfo.append(f'{icr}->{newseq}')
                    flag = True
                    break
            # 当前位置 seq[i] 已经找到就不需要查看 seq[i+1,n-1] ，因此 break
            if flag: break
            seq[i] = originChar  # 恢复
    return correctInfo

fasta = “””

Rosalind_52 TCATC Rosalind_44 TTCAT Rosalind_68 TCATC Rosalind_28 TGAAA Rosalind_95 GAGGA Rosalind_66 TTTCA Rosalind_33 ATCAA Rosalind_21 TTGAT Rosalind_18 TTTCC “”” reads = [mat.group(2).replace(‘\n’, ‘’) for mat in re.finditer(r’>(.)\n((?s:[^>]))’, fasta)] for ret in Solution().correctError(reads): print(ret) `` **复杂度分析**：设共有n条reads，且read长度都为k，其中错误reads有m` 条。

时间复杂度：，一共分为两个步骤
- 判断正确性：对 reads 使用哈希计数器进行正确性判断需要；
- 纠错：枚举错误 reads 需要且错误 read 逐位枚举 3 中可能情况后再查正确 reads 表需要，因此每条错误 read 纠正需要时间。
时间复杂度：，三个哈希表。