Point Mutations Include Insertions and Deletions
In “Counting Point Mutations”, we saw that Hamming distance gave us a preliminary notion of the evolutionary distance between two DNA strings by counting the minimum number of single nucleotide substitutions that could have occurred on the evolutionary path between the two strands.
However, in practice, homologous strands of DNA or protein are rarely the same length because point mutations also include the insertion or deletion of a single nucleotide (and single amino acids can be inserted or deleted from peptides). Thus, we need to incorporate these insertions and deletions into the calculation of the minimum number of point mutations between two strings. One of the simplest models charges a unit “cost” to any single-symbol insertion/deletion, then (in keeping with parsimony) requests the minimum cost over all transformations of one genetic string into another by point substitutions, insertions, and deletions.
Problem
Given two strings s and t (of possibly different lengths), the edit distance dE(s,t) is the minimum number of edit operations needed to transform s into t, where an edit operation is defined as the substitution, insertion, or deletion of a single symbol.
The latter two operations incorporate the case in which a contiguous interval is inserted into or deleted from a string; such an interval is called a gap. For the purposes of this problem, the insertion or deletion of a gap of length k still counts as k distinct edit operations.
Given: Two protein strings s and t in FASTA format (each of length at most 1000 aa).
Return: The edit distance dE(s,t).
Sample Dataset
Rosalind_39
PLEASANTLY
>Rosalind_11
MEANLYSample Output
Solution
from typing import List
import re
class Solution:
def minDistance(self, word1: str, word2: str) -> int:
n = len(word1)
m = len(word2)
# 有一个字符串为空串
if n * m == 0:
return n + m
# DP 数组
D = [ [0] * (m + 1) for _ in range(n + 1)]
# 边界状态初始化
for i in range(n + 1):
D[i][0] = i
for j in range(m + 1):
D[0][j] = j
# 计算所有 DP 值
for i in range(1, n + 1):
for j in range(1, m + 1):
left = D[i - 1][j] + 1
down = D[i][j - 1] + 1
left_down = D[i - 1][j - 1]
if word1[i - 1] != word2[j - 1]:
left_down += 1
D[i][j] = min(left, down, left_down)
return D[n][m]
fasta = """
>Rosalind_6004
NADSCIPKAAFMHCLSHFFAMSLPNNCKCINDDNVVMYAIHSSYTKLSPVHTTIMGWPKR
MWQPCKWCEAEQAYDDRCTKIGMYCLPCTRQCWFLCCVEWMEHVNQRFVLIDQEERITTC
IMDRTMCSCKSLSFYFMIQNSMHDDITYFFVLHYPHIRMSSRNAHMLGYTPINRLPKHSF
SYIFHHKYYCFANCWMTLVIMGGLPKQWCLSYPFEYYHSNDQITYYVTPVNPWNMFMQTH
MDKMDTKALTCDNHHQTSENWYNRPAHEPHEWLDCTWQYDYENICEDQYKNKVSVYPWAV
AFGMMYRFDTDMSLYQYEKNGLESSWIYVVACKCWDMLDGWWFQVKFNTFDQRASWFIGG
HQHHRDLGCTFQTNMDCAFDLPWVDLNESEFASLNFNNCTHIGFTENWWNGEVAKYAHDM
PNNYVVAFLIVRFATQHTMYYPSSYNQKWSEAEGLFNPHPIVELYLMCGLLYMQPCTMES
KHKKGNFMNMSPWHPWWSWGMWMYWTERHHSDEGAAIKITEKNNHYMYMCQNTESRLNEC
LHTTDAYHLRYEMRENPLFYWFDCQDYCDSKNQLVDRLEDYDCMRPCSFTVASENACVPE
KRGGVLFGAIHGLMPWPSIEGQYQRKWYKLCYMCIACMQSKLEIVPPYDGPVEKRMFQIC
QPLELFEYKPKCDNGFWPHTKLNEFWNPDNNNIQFWGGWNFIKRNARTVMIYMNPMQKQG
LDGYNSLVKPKNCNFMHGQYMPMSEIAMPQKPLNENWPQRKMYALDATDCGQWYFSTRNN
HPWQFNRHFKVSTGWNKGCSLRFGLESDCERPCLEYFIFDRFHWTAGHVGDRLSYVDHRE
YVNCPCRPATPWALALIVHWPAEDTWIRYYDSELDYHDNDE
>Rosalind_7611
NADCKTECIPKPAFMHKLSHFFAMSLPDNCKNQDNVVCYAIHSSYTNLSPVHTTIRGYCV
PKRMWQPCAYDDRCTKIGMYCLPCIRQCWFLCHDQEERITTQPGTSWKTKMWRTECSCKS
LSFIFMIVVNSMHDIVLLQITYFWVLHYPHIRMSRRNAHPIMAELGYTPVCSYIIHHKYY
CFANCWMTLVIAGGLYKYQTVSRKWCLSYPVEPYHSNDQIKYYFTPVRPWNMFSQTHMDW
WDTKALTIFHLMSDNHHQTVENWYPHEMWNDCPWQYDYENICLDQPENKNKPSVYPWATM
DAFGMMDMSLYQYEKNGLESSWIYVVACKFWDMLLLNHWMFNQWDFQVKPNTFDQRASWF
CGGHMPWYAIQHHRDLGCTFQWLNMDCAALRMQWCRQDLPWVDLNFNNCGVKWSPFHIGF
TENWWNGEVAKYYTVQWVDAFLIVRFATQHTMYYPGRYWSELFNPHPICELYLMCQCQPC
TDESKHKRGNFMNMSPDHCHAYFTGWQRRHQWRLIMAKQSDEGKWTTMIGDNAIKGFQYK
CWTWTEKNNHYMYMCQNTESRLNECLKTTDAYHLWAMDYEMRENPLFYWFDSKNQLVDRL
EDYVPCKFLGPHWKNVASENACVPEKRGYWVLFGAIHGLMPWPSIEGQYQRKWYKLCYMV
CEIVDPYDGPVEKRMFQICQPLENEFWLKCHACNGFWPHTKLNEFWNPDWNNIQFWGRWN
THVRIFRIIMRNARAHGNIYMNPMQSQGLDYYKRVKFMHGNYMPGSEIAMPQKWPQRKCM
LYEEKHALMPTDTHWFYDVGQWYLCMARMKATRNNHPWQFNLSRHFKVSTSKGASLRFGL
EIDCEPRHGIMDMCKEYFIFDRGDRLSYQDKSWRAKALIVYWEDMAIRYEISNKCINDE
"""
seqs = [seq.replace('\n', '') for seq in re.split(r'>.*', fasta) if seq.replace('\n', '')]
# print(seqs)
print(Solution().minDistance(*seqs))