Penalizing Large Insertions and Deletions
In dealing with global alignment in “Global Alignment with Scoring Matrix”, we encountered a linear gap penalty, in which the insertion or deletion of a gap is penalized by some constant times the length of the gap. However, this model is not necessarily the most practical model, as one large rearrangement could have inserted or deleted a long gap in a single step to transform one genetic string into another.
Problem
In a constant gap penalty, every gap receives some predetermined constant penalty, regardless of its length. Thus, the insertion or deletion of 1000 contiguous symbols is penalized equally to that of a single symbol.
Given: Two protein strings ss and tt in FASTA format (each of length at most 1000 aa).
Return: The maximum alignment score between ss and tt. Use:
- The BLOSUM62 scoring matrix.
Constant gap penalty equal to 5.
Sample Dataset
Rosalind_79 PLEASANTLY >Rosalind_41 MEANLY
Sample Output
Solution:动态规划
本题就是连续插入或删除只算一次罚分,因此我们需要在原有的状态转移方程上多加一个状态记录当前是否是连续删除或插入。 :::tips 题目只允许插入
-
,也就是表面上只允许插入操作,何来删除操作呢?
s: A—TCG
t: GAAT—
我们只看s
串,规定其本身连续--
叫连续插入,而t
串中连续--
等价于s
串连续删除。谁插入谁下标暂时不变,另一个下标变化。 ::: 状态定义:dp[i][j][k]
表示s前i个字符和t前j个字符末尾状态为连续删除(k=2)或连续插入(k=1)或非indels(k=0)的最大得分。
状态转移方程:当前
s[i]
状态为非indels,即dp[i][j][0]
,因为不引入连续indels状态,因此只能从对角线格子(i-1,j-1)
转移而来,,但是(i-1,j-1)
有三种状态,因此我们取三个状态最大值再加上匹配函数- 当前
s[i]
状态为连续insert,即dp[i][j][1]
,因为引入连续状态,因此只能从上侧格子(i,j-1)
转移而来,但是(i,j-1)
有三种状态如何抉择与此时当前状态是连续insert起点与否有关?- 当前非连续insert起点,即上一次是连续insert状态,则这次删除不罚分,即
dp[i][j][1] = dp[i][j-1][1]
- 当前连续insert起点,即上一次是非连续insert状态(共有两种状态),则这次删除开始罚分,由于上一次非连续insert状态有连续delete或非连续indels,需要取这两最值,即
dp[i][j][1] = GAP + max(dp[i][j-1][0], dp[i][j-1][2])
- 当前非连续insert起点,即上一次是连续insert状态,则这次删除不罚分,即
- 当前
s[i]
状态为连续delete(t
串正进行连续insert
),即dp[i][j][2]
,因为引入连续delete状态,因此只能从左侧格子(i-1,j)
转移而来,但是(i-1,j)
有三种状态如何抉择与此时当前状态是连续delete起点与否有关?- 当前非连续delete起点,即上一次是连续delete状态,则这次删除不罚分,即
dp[i][j][2] = dp[i-1][j][2]
- 当前连续delete起点,即上一次是非连续delete状态(共有两种状态),则这次删除开始罚分,由于上一次非连续delete状态有连续insert或非连续indels,需要取这两最值,即
dp[i][j][2] = GAP + max(dp[i-1][j][0], dp[i-1][j][1])
- 当前非连续delete起点,即上一次是连续delete状态,则这次删除不罚分,即
状态初始化:当s, t
任一为空时的状态计算。
- 若
s = '', t = ''
,当前s[i]
只有非indels状态,因此dp[0][0][0] = 0, dp[0][0][1] = dp[0][0][2] = -∞
- 若
s = '', t != ''
,当前s[i]
只有连续insert状态,因此dp[0][j][0] = -∞, dp[0][j][1] = GAP, dp[0][j][2] = -∞
- 若
s != '', t = ''
,当前s[i]
只有连续delete状态,因此dp[i][0][0] = -∞, dp[i][0][1] = -∞, dp[i][0][2] = GAP
所求答案:根据状态转移方程定义,最终全场的三种结尾状态的最大值,即max(dp[-1][-1])
from typing import List
ch2idx = lambda ch: ord(ch) - ord('A')
BLOSUM62 = """
A C D E F G H I K L M N P Q R S T V W Y
A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2
C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2
D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3
E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2
F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3
G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3
H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2
I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1
K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2
L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1
M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1
N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2
P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3
Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1
R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2
S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2
T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2
V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1
W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2
Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7
"""
cost = [[0] * 26 for _ in range(26)]
head, *nxts = (line.split() for line in BLOSUM62.splitlines() if line.strip())
# print(head, nxts)
# 处理表头
for base2, *nums in nxts:
for base1, num in zip(head, nums):
cost[ch2idx(base1)][ch2idx(base2)] = int(num)
GAP = -5
class Solution:
def globalAligmment(self, s: str, t: str) -> int:
"""全局比对"""
m, n = len(s), len(t)
dp = [[[0, 0, 0] for _ in range(n + 1)] for _ in range(m + 1)]
dp[0][0][0], dp[0][0][1], dp[0][0][2] = 0, float('-inf'), float('-inf')
# 1. 状态初始化
for i in range(1, m + 1): # s != '', t = '', 对 s 而言是连续删除
dp[i][0][0], dp[i][0][1], dp[i][0][2] = float('-inf'), float('-inf'), GAP
for j in range(1, n + 1): # s = '', t != '', 对 s 而言是连续插入
dp[0][j][0], dp[0][j][1], dp[0][j][2] = float('-inf'), GAP, float('-inf')
# 2. 状态转移
for i in range(1, m + 1):
for j in range(1, n + 1):
dp[i][j][0] = max(
dp[i-1][j-1][0], dp[i-1][j-1][1], dp[i-1][j-1][2]
) + cost[ch2idx(s[i-1])][ch2idx(t[j-1])]
# s 连续插入
dp[i][j][1] = max(
dp[i][j-1][0] + GAP,
dp[i][j-1][1],
dp[i][j-1][2] + GAP,
)
# t连续插入=s连续删除
dp[i][j][2] = max(
dp[i-1][j][0] + GAP,
dp[i-1][j][1] + GAP,
dp[i-1][j][2],
)
def printPath(i: int, j: int, k: int, path1: List[str], path2: List[str]) -> None:
if i + j == 0:
print(''.join(reversed(path1)))
print(''.join(reversed(path2)))
print()
return
if 0 == k:
matchScore = cost[ch2idx(s[i-1])][ch2idx(t[j-1])]
for k in range(3):
if dp[i][j][0] == dp[i-1][j-1][k] + matchScore:
path1.append(s[i-1])
path2.append(t[j-1])
printPath(i - 1, j - 1, k, path1, path2)
path1.pop()
path2.pop()
elif 1 == k:
for k in range(3):
if dp[i][j][1] == dp[i][j-1][k] + (GAP if k != 1 else 0):
path1.append('-')
path2.append(t[j-1])
printPath(i, j - 1, k, path1, path2)
path1.pop()
path2.pop()
else:
for k in range(3):
if dp[i][j][2] == dp[i-1][j][k] + (GAP if k != 2 else 0):
path1.append(s[i-1])
path2.append('-')
printPath(i - 1, j, k, path1, path2)
path1.pop()
path2.pop()
ans = max(dp[m][n])
for k in range(3):
if dp[m][n][k] == ans:
printPath(m, n, k, [], [])
return max(dp[m][n])
fasta = """
>Rosalind_79
PLEASANTLY
>Rosalind_41
MEANLY
"""
"""
>Rosalind_79
PLSANTLY
>Rosalind_41
MEANLY
"""
import re
fasta = [seq.replace('\n', '') for seq in re.split(r'>.*', fasta) if seq.replace('\n', '')]
print(Solution().globalAligmment(*fasta))
PLEASANTLY
-ME—AN-LY
P | L | E | A | S | A | N | T | L | Y |
---|---|---|---|---|---|---|---|---|---|
- | M | E | - | - | A | N | - | L | Y |
-5 | 2 | 5 | -5 | 4 | 6 | -5 | 4 | 7 |
PLEASANTLY
-MEA—N-LY
P | L | E | A | S | A | N | T | L | Y |
---|---|---|---|---|---|---|---|---|---|
- | M | E | A | - | - | N | - | L | Y |
-5 | 2 | 5 | 4 | -5 | 6 | -5 | 4 | 7 |
PLEASANTLY
-MEAN—-LY
P | L | E | A | S | A | N | T | L | Y |
---|---|---|---|---|---|---|---|---|---|
- | M | E | A | N | - | - | - | L | Y |
-5 | 2 | 5 | 4 | 1 | -5 | 4 | 7 |