Point Mutations Include Insertions and Deletions

In “Counting Point Mutations”, we saw that Hamming distance gave us a preliminary notion of the evolutionary distance between two DNA strings by counting the minimum number of single nucleotide substitutions that could have occurred on the evolutionary path between the two strands.
However, in practice, homologous strands of DNA or protein are rarely the same length because point mutations also include the insertion or deletion of a single nucleotide (and single amino acids can be inserted or deleted from peptides). Thus, we need to incorporate these insertions and deletions into the calculation of the minimum number of point mutations between two strings. One of the simplest models charges a unit “cost” to any single-symbol insertion/deletion, then (in keeping with parsimony) requests the minimum cost over all transformations of one genetic string into another by point substitutions, insertions, and deletions.

Problem

Given two strings s and t (of possibly different lengths), the edit distance dE(s,t) is the minimum number of edit operations needed to transform s into t, where an edit operation is defined as the substitution, insertion, or deletion of a single symbol.
The latter two operations incorporate the case in which a contiguous interval is inserted into or deleted from a string; such an interval is called a gap. For the purposes of this problem, the insertion or deletion of a gap of length k still counts as k distinct edit operations.
Given: Two protein strings s and t in FASTA format (each of length at most 1000 aa).
Return: The edit distance dE(s,t).

Sample Dataset

Rosalind_39
PLEASANTLY
>Rosalind_11
MEANLY

Sample Output

5

Solution

语雀内容

  1. from typing import List
  2. import re
  3. class Solution:
  4. def minDistance(self, word1: str, word2: str) -> int:
  5. n = len(word1)
  6. m = len(word2)
  7. # 有一个字符串为空串
  8. if n * m == 0:
  9. return n + m
  10. # DP 数组
  11. D = [ [0] * (m + 1) for _ in range(n + 1)]
  12. # 边界状态初始化
  13. for i in range(n + 1):
  14. D[i][0] = i
  15. for j in range(m + 1):
  16. D[0][j] = j
  17. # 计算所有 DP 值
  18. for i in range(1, n + 1):
  19. for j in range(1, m + 1):
  20. left = D[i - 1][j] + 1
  21. down = D[i][j - 1] + 1
  22. left_down = D[i - 1][j - 1]
  23. if word1[i - 1] != word2[j - 1]:
  24. left_down += 1
  25. D[i][j] = min(left, down, left_down)
  26. return D[n][m]
  27. fasta = """
  28. >Rosalind_6004
  29. NADSCIPKAAFMHCLSHFFAMSLPNNCKCINDDNVVMYAIHSSYTKLSPVHTTIMGWPKR
  30. MWQPCKWCEAEQAYDDRCTKIGMYCLPCTRQCWFLCCVEWMEHVNQRFVLIDQEERITTC
  31. IMDRTMCSCKSLSFYFMIQNSMHDDITYFFVLHYPHIRMSSRNAHMLGYTPINRLPKHSF
  32. SYIFHHKYYCFANCWMTLVIMGGLPKQWCLSYPFEYYHSNDQITYYVTPVNPWNMFMQTH
  33. MDKMDTKALTCDNHHQTSENWYNRPAHEPHEWLDCTWQYDYENICEDQYKNKVSVYPWAV
  34. AFGMMYRFDTDMSLYQYEKNGLESSWIYVVACKCWDMLDGWWFQVKFNTFDQRASWFIGG
  35. HQHHRDLGCTFQTNMDCAFDLPWVDLNESEFASLNFNNCTHIGFTENWWNGEVAKYAHDM
  36. PNNYVVAFLIVRFATQHTMYYPSSYNQKWSEAEGLFNPHPIVELYLMCGLLYMQPCTMES
  37. KHKKGNFMNMSPWHPWWSWGMWMYWTERHHSDEGAAIKITEKNNHYMYMCQNTESRLNEC
  38. LHTTDAYHLRYEMRENPLFYWFDCQDYCDSKNQLVDRLEDYDCMRPCSFTVASENACVPE
  39. KRGGVLFGAIHGLMPWPSIEGQYQRKWYKLCYMCIACMQSKLEIVPPYDGPVEKRMFQIC
  40. QPLELFEYKPKCDNGFWPHTKLNEFWNPDNNNIQFWGGWNFIKRNARTVMIYMNPMQKQG
  41. LDGYNSLVKPKNCNFMHGQYMPMSEIAMPQKPLNENWPQRKMYALDATDCGQWYFSTRNN
  42. HPWQFNRHFKVSTGWNKGCSLRFGLESDCERPCLEYFIFDRFHWTAGHVGDRLSYVDHRE
  43. YVNCPCRPATPWALALIVHWPAEDTWIRYYDSELDYHDNDE
  44. >Rosalind_7611
  45. NADCKTECIPKPAFMHKLSHFFAMSLPDNCKNQDNVVCYAIHSSYTNLSPVHTTIRGYCV
  46. PKRMWQPCAYDDRCTKIGMYCLPCIRQCWFLCHDQEERITTQPGTSWKTKMWRTECSCKS
  47. LSFIFMIVVNSMHDIVLLQITYFWVLHYPHIRMSRRNAHPIMAELGYTPVCSYIIHHKYY
  48. CFANCWMTLVIAGGLYKYQTVSRKWCLSYPVEPYHSNDQIKYYFTPVRPWNMFSQTHMDW
  49. WDTKALTIFHLMSDNHHQTVENWYPHEMWNDCPWQYDYENICLDQPENKNKPSVYPWATM
  50. DAFGMMDMSLYQYEKNGLESSWIYVVACKFWDMLLLNHWMFNQWDFQVKPNTFDQRASWF
  51. CGGHMPWYAIQHHRDLGCTFQWLNMDCAALRMQWCRQDLPWVDLNFNNCGVKWSPFHIGF
  52. TENWWNGEVAKYYTVQWVDAFLIVRFATQHTMYYPGRYWSELFNPHPICELYLMCQCQPC
  53. TDESKHKRGNFMNMSPDHCHAYFTGWQRRHQWRLIMAKQSDEGKWTTMIGDNAIKGFQYK
  54. CWTWTEKNNHYMYMCQNTESRLNECLKTTDAYHLWAMDYEMRENPLFYWFDSKNQLVDRL
  55. EDYVPCKFLGPHWKNVASENACVPEKRGYWVLFGAIHGLMPWPSIEGQYQRKWYKLCYMV
  56. CEIVDPYDGPVEKRMFQICQPLENEFWLKCHACNGFWPHTKLNEFWNPDWNNIQFWGRWN
  57. THVRIFRIIMRNARAHGNIYMNPMQSQGLDYYKRVKFMHGNYMPGSEIAMPQKWPQRKCM
  58. LYEEKHALMPTDTHWFYDVGQWYLCMARMKATRNNHPWQFNLSRHFKVSTSKGASLRFGL
  59. EIDCEPRHGIMDMCKEYFIFDRGDRLSYQDKSWRAKALIVYWEDMAIRYEISNKCINDE
  60. """
  61. seqs = [seq.replace('\n', '') for seq in re.split(r'>.*', fasta) if seq.replace('\n', '')]
  62. # print(seqs)
  63. print(Solution().minDistance(*seqs))