Generalizing the Alignment Score

The edit alignment score in “Edit Distance Alignment” counted the total number of edit operations implied by an alignment; we could equivalently think of this scoring function as assigning a cost of 1 to each such operation. Another common scoring function awards matched symbols with 1 and penalizes substituted/inserted/deleted symbols equally by assigning each one a score of 0, so that the maximum score of an alignment becomes the length of a longest common subsequence of ss and tt (see “Finding a Shared Spliced Motif”). In general, the alignment score is simply a scoring function that assigns costs to edit operations encoded by the alignment.
One natural way of adding complexity to alignment scoring functions is by changing the alignment score based on which symbols are substituted; many methods have been proposed for doing this. Another way to do so is to vary the penalty assigned to the insertion or deletion of symbols.
In general, alignment scores can be either maximized or minimized depending on how scores are established. The general problem of optimizing a particular alignment score is called global alignment.

Problem

To penalize symbol substitutions differently depending on which two symbols are involved in the substitution, we obtain a scoring matrix S in which Global Alignment with Scoring Matrix - 图1 represents the (negative) score assigned to a substitution of the ith symbol of our alphabet AA with the jth symbol of A.
A gap penalty is the component deducted from alignment score due to the presence of a gap. A gap penalty may be a function of the length of the gap; for example, a linear gap penalty is a constant g such that each inserted or deleted symbol is charged g; as a result, the cost of a gap of length L is equal to gL.
Given: Two protein strings s and t in FASTA format (each of length at most 1000 aa).
Return: The maximum alignment score between s and t. Use:

  • The BLOSUM62 scoring matrix.
  • Linear gap penalty equal to 5 (i.e., a cost of -5 is assessed for each gap symbol).

    Sample Dataset

    Rosalind_67
    PLEASANTLY
    >Rosalind_17
    MEANLY

    Sample Output

    8

    Solution

    本题就是 NW 算法,学过编辑距离就会。无非就是三个操作的成本增加而已。
    BLOSUM62, given in detail below, is a commonly used scoring matrix in alignment problems considering protein strings. It assigns different alignment scores to substituted amino acids depending on the particular acids substituted; scores were computed based on the relative frequency of one amino acid to be substituted for another in a collection of known alignments.
    The scoring matrix is “symmetric”, which in this case means that the score assigned to substituting symbol xx for symbol y is the same as that of substituting y for x.

    1. A C D E F G H I K L M N P Q R S T V W Y
    2. A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2
    3. C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2
    4. D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3
    5. E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2
    6. F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3
    7. G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3
    8. H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2
    9. I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1
    10. K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2
    11. L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1
    12. M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1
    13. N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2
    14. P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3
    15. Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1
    16. R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2
    17. S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2
    18. T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2
    19. V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1
    20. W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2
    21. Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7

    ```python ch2idx = lambda ch: ord(ch) - ord(‘A’)

BLOSUM62 = “”” A C D E F G H I K L M N P Q R S T V W Y A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 “””

cost = [[0] 26 for _ in range(26)] head, nxts = (line.split() for line in BLOSUM62.splitlines() if line.strip())

print(head, nxts)

处理表头

for base2, *nums in nxts: for base1, num in zip(head, nums): cost[ch2idx(base1)][ch2idx(base2)] = int(num)

print(cost)

GAP = -5

class Solution: def globalAligmment(self, s: str, t: str) -> int: “””全局比对””” m, n = len(s), len(t) dp = [[0] * (n + 1) for _ in range(m + 1)]

  1. # 1. 状态初始化
  2. for i in range(1, m + 1):
  3. dp[i][0] = dp[i-1][0] + GAP
  4. for j in range(1, n + 1):
  5. dp[0][j] = dp[0][j-1] + GAP
  6. # 2. dp[i][j] = max(dp[i-1][j-1] + mismatch, dp[i][j-1] + GAP, dp[i-1][j] + GAP, dp[i-1][j-1] + match)
  7. for i in range(1, m + 1):
  8. for j in range(1, n + 1):
  9. dp[i][j] = max(
  10. dp[i-1][j-1] + cost[ch2idx(s[i-1])][ch2idx(t[j-1])], # cost 包含 match/mismatch 分数
  11. dp[i][j-1] + GAP,
  12. dp[i-1][j] + GAP,
  13. )

print(dp[i])

  1. return dp[m][n]

fasta = “””

Rosalind_8291 CGYGLPPYTGSIQWYHLLGEENIWLHRHMRQEKADTGNIFDKYTFGWCFTIVARLNMTGA LFLEMQYCMSAYNPGTEMQRGASFLTEDQTIDFIYDVKDGYVNYIFHDFTIHKSGETPEY LDVYRITALAGNVDVCMAGNTTYQDRYECEIKNLCPRHNTIKKPKSYSIMLMLFHFKFPR SVWCEFIGVNDEQCITECDVRAIMQDASMCRIRPWQVMGELQWGCTIRTDKNLEMKSSPY VPALHLGTYTMYHQNYWNKCLMLGNQYRERISATLNHHAMHGWHKIKHFCQYGRWMDMIE LEDPCVQMELRNMAWNLRKLDTKCWMATWKKWMNYFPGFIVPPKYVCTIIHYVRPGQFCN YLAMDQFECPSMKPMFAHIKVRFLCNKEWFYLPYRCRDSFINSVHVGQFHDIDKWQTIVK KESCMAAGNYWHFWLHTWPGAMSCMHLTIVQSLFHAAPYYSPWCVFAKIVVNDQWRQWHD DTPKTEHSVHGHCKVKFQMHEHCVETEHVIHPLRVHKHWTLGLNMQSASPQTIKGAQQRA AVDDCDPQGWQYTRVDTRATEPACYKAYGMGSYWTADNENSCDCHENYNIRDGSSDTGKP LKKEYWDEWARKTNVMGASGQQGAGMASGGGTTWQEEASVMTDIGPWVQEQRIWQDEWSS RTQPQVDWVFPHDEVTFADTDIKKTPNLHKNQMHLPTVTQCNASGPFIRATKFQLYQKNS ITTMRLWLIWCYEGMAMTGESGEVLCMPPP Rosalind_5240 CGYGLPPYTGSIQWYHNIWGHRHMEQEKADTSNIFDKYTFGWTIDDIYCRLNMTDALKLE MQMSELMGHWKPKYNPGTEMQRTPATIDHEMGRGAVIYDVPDQAFGPYYVNYIFHDFKIG KSNPGKMKSHAVGKCHAPYRITALAGAVLMAGNTTYQDRYCSVFLWGRKSYSIFFIPCMW KFKQSWCCFIGVNKWCQMCFEQCQDLRIRPWQVMGDIMWSRLGLQNTDVWRIKHRTDNLV YYSPALHLGYMQNYWNKCLMLGNQYRERITNVVTRNSATLNHNAMVGWHKIKFCQYGREW ELEDPCVQMELLRKLDYYWYNYFPGFIYICWYQPPKYVCTIIYVRPGQDCNYLAMDQKEC PSMKPMFHDIYSECSHLKVRFLCNKEFYLPYRDSPQKSCYINSSVVGFFHDRKMVGFTDK WQTIVKKEMAYLCMAAGTYWHFWLYHSPETWPGAMSCMHLTIVQSLFFTCAAPKQQYSPV CVFAKIPVNDQWRKCKVKFQYHYILPHSVHDVETEHVIHPLKCQVDKHMQSASPQTMFRK GAQQRAAVWARELRAQRWQYKRVDTYDPACKKAYSMYWTADNENICDNVSPANQMHENYN IRDGTGVTHQQCFTKREYWDEWARKTNVGGMASQMNGGMTDIRFPQRIWVMKPKQQPRVH PQPQVDWSFPHDLEWWEANVTFADIARFFDDVPIDKTPHLHKNQMHLPTVTQCAEWLECC AQGPFIRATKFQLYQKNNITIKQFMRLWLFPCYGGMAMTGRSGEVLCMPPCH “””

import re fasta = [seq.replace(‘\n’, ‘’) for seq in re.split(r’>.*’, fasta) if seq.replace(‘\n’, ‘’)]

print(Solution().globalAligmment(*fasta)) ```