Introduction to Distance-Based Phylogeny

A number of different approaches are used to build phylogenies, each one featuring its own computational strengths and weaknesses. One of these measures is distance-based phylogeny, which constructs a tree from evolutionary distances calculated between pairs of taxa.
A wide assortment of different measures exist for quantifying this evolutionary distance. Once we have selected a distance function and used it to calculate the distance between every pair of taxa, we place these pairwise distance functions into a table.
In this problem, we will consider an evolutionary function based on Hamming distance. Recall from “Counting Point Mutations” that this function compares two homologous strands of DNA by counting the minimum possible number of point mutations that could have occurred on the evolutionary path between the two strands.

Problem

For two strings s1 and s2 of equal length, the p-distance between them, denoted dp(s1,s2), is the proportion of corresponding symbols that differ between s1 and s2.
For a general distance function dd on nn taxa s1,s2,…,sn (taxa are often represented by genetic strings), we may encode the distances between pairs of taxa via a distance matrix D in which Di,j=d(si,sj).
Given: A collection of n (n≤10) DNA strings s1,…,sns1,…,sn of equal length (at most 1 kbp). Strings are given in FASTA format.
Return: The matrix D corresponding to the p-distance dp on the given strings. As always, note that your answer is allowed an absolute error of 0.001.

Sample Dataset

Rosalind_9499
TTTCCATTTA
>Rosalind_0942
GATTCATTTC
>Rosalind_6568
TTTCCATTTT
>Rosalind_1833
GTTCCATTTA

Sample Output

0.00000 0.40000 0.10000 0.10000
0.40000 0.00000 0.40000 0.30000
0.10000 0.40000 0.00000 0.20000
0.10000 0.30000 0.20000 0.00000

Solution

距离定义为等长两序列之间不同对的碱基个数再除一条序列长度的比值。

  1. from typing import List
  2. import re
  3. class Solution:
  4. def distanceMatrix(self, seqs: List[str]) -> List[List[float]]:
  5. n, k = len(seqs), len(seqs[0])
  6. matrix = [[0.00000] * n for _ in range(n)]
  7. for i in range(n):
  8. for j in range(i, n):
  9. matrix[i][j] = matrix[j][i] = sum(a != b for a, b in zip(seqs[i], seqs[j])) / k
  10. return matrix
  11. fasta = """
  12. >Rosalind_9499
  13. TTTCCATTTA
  14. >Rosalind_0942
  15. GATTCATTTC
  16. >Rosalind_6568
  17. TTTCCATTTT
  18. >Rosalind_1833
  19. GTTCCATTTA
  20. """
  21. seqs = list(seq.replace('\n', '') for seq in re.findall('[ATCG\n]+', fasta) if seq.replace('\n', ''))
  22. for lst in Solution().distanceMatrix(seqs):
  23. for p in lst:
  24. print(f'{p:.5f}', end=' ')
  25. print()