Finding a Most Likely Common Ancestor

In “Counting Point Mutations”, we calculated the minimum number of symbol mismatches between two strings of equal length to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands.
image.png

Sample Dataset

  1. >Rosalind_1
  2. ATCCAGCT
  3. >Rosalind_2
  4. GGGCAACT
  5. >Rosalind_3
  6. ATGGATCT
  7. >Rosalind_4
  8. AAGCAACC
  9. >Rosalind_5
  10. TTGGAACT
  11. >Rosalind_6
  12. ATGCCATT
  13. >Rosalind_7
  14. ATGGCACT

Sample Output

  1. ATGCAACT
  2. A: 5 1 0 0 5 5 0 0
  3. C: 0 0 1 4 2 0 6 1
  4. G: 1 1 6 3 0 1 0 0
  5. T: 1 5 0 0 0 1 1 6

Solution

本题不得不涉及文件操作,也是生物信息学上必须掌握读取 fasta 格式文件的方式。以 > 作为一条序列的开始,那么也是一条序列的结束。最难处理的就是第一条序列和第二条序列。

  1. 第一条序列:因为以 > 作为序列终止,因此第一次遇到 > 需要判断 seq 是否为空,即第一条!
  2. 最后一条序列:由于没有后继的 > 作为终止,因此最后需要进一步筛选放入。

还有程序要尽量处理一些没用的空白符问题。

然后就是序列普问题,使用一个字典,一个键对应 len(seqs[0]) 长度的列表,然后遍历序列,分别统计每条序列对应位置上的碱基出现次数。最后再求出每个位置最大的就是最保守的序列。

  1. from typing import List
  2. class Solution:
  3. def readFasta(self, fileName: str) -> List[str]:
  4. seqs = []
  5. with open(fileName) as f:
  6. seq = []
  7. for line in f:
  8. line = line.strip() # withespace
  9. if line.startswith('>'): # ID
  10. if seq: # last sequence finished
  11. seqs.append(''.join(seq))
  12. seq = []
  13. elif line: # bases
  14. seq.append(line.strip())
  15. # last sequence
  16. if seq:
  17. seqs.append(''.join(seq))
  18. print(seqs)
  19. return seqs
  20. def calculateProfile(self) -> None:
  21. seqs = self.readFasta('./DNA.fasta')
  22. if not seqs or not seqs[0]: return
  23. k = len(seqs[0])
  24. d = {base: [0] * k for base in 'AGCT'}
  25. for seq in seqs:
  26. for idx, base in enumerate(seq):
  27. d[base][idx] += 1
  28. print(''.join((max(d, key=lambda x: d[x][i]) for i in range(k))))
  29. for base in 'ACGT':
  30. print(f'{base}:', ' '.join(map(str, d[base])))
  31. Solution().calculateProfile()
  32. '''
  33. ['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']
  34. ATGCAACT
  35. A: 5 1 0 0 5 5 0 0
  36. C: 0 0 1 4 2 0 6 1
  37. G: 1 1 6 3 0 1 0 0
  38. T: 1 5 0 0 0 1 1
  39. '''