入门题 - Consensus and Profile - 《生物信息》

Finding a Most Likely Common Ancestor
Sample Dataset
Sample Output
Solution

Finding a Most Likely Common Ancestor

In “Counting Point Mutations”, we calculated the minimum number of symbol mismatches between two strings of equal length to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands.

Sample Dataset

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

Solution

本题不得不涉及文件操作，也是生物信息学上必须掌握读取 fasta 格式文件的方式。以 > 作为一条序列的开始，那么也是一条序列的结束。最难处理的就是第一条序列和第二条序列。

第一条序列：因为以 > 作为序列终止，因此第一次遇到 > 需要判断 seq 是否为空，即第一条！
最后一条序列：由于没有后继的 > 作为终止，因此最后需要进一步筛选放入。

还有程序要尽量处理一些没用的空白符问题。

然后就是序列普问题，使用一个字典，一个键对应 len(seqs[0]) 长度的列表，然后遍历序列，分别统计每条序列对应位置上的碱基出现次数。最后再求出每个位置最大的就是最保守的序列。

from typing import List
class Solution:
    def readFasta(self, fileName: str) -> List[str]:
        seqs = []
        with open(fileName) as f:
            seq = []
            for line in f:
                line = line.strip()  # withespace
                if line.startswith('>'):  # ID
                    if seq:  # last sequence finished
                        seqs.append(''.join(seq))
                        seq = []
                elif line:  # bases
                    seq.append(line.strip())
            # last sequence
            if seq:
                seqs.append(''.join(seq))
            print(seqs)
        return seqs
    def calculateProfile(self) -> None:
        seqs = self.readFasta('./DNA.fasta')
        if not seqs or not seqs[0]: return
        k = len(seqs[0])
        d = {base: [0] * k for base in 'AGCT'}
        for seq in seqs:
            for idx, base in enumerate(seq):
                d[base][idx] += 1
        print(''.join((max(d, key=lambda x: d[x][i]) for i in range(k))))
        for base in 'ACGT':
            print(f'{base}:', ' '.join(map(str, d[base])))
Solution().calculateProfile()
'''
['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 
'''