Beware of Alignment Inference

In “Edit Distance Alignment”, we introduced the concept of an alignment of two genetic strings having differing lengths with respect to edit distance. This provided us with a way of visualizing a specific collection of symbol substitutions, insertions, and deletions that could have taken place on the evolutionary path between the two strings.
However, simply finding one optimal alignment and declaring that it represents a true evolutionary history is a dangerous idea because the actual evolutionary picture may be suboptimal. For that matter, the collection of all optimal alignments may be huge, and the characteristics of these alignments could differ widely.
In order to begin analyzing the collection of optimal alignments for a pair of strings, the first question we will ask is simple: just how many optimal alignments exist?

Problem

Recall from “Edit Distance Alignment” that if s′ and t′ are the augmented strings corresponding to an alignment of strings s and t, then the edit alignment score of s′ and t′ was given by the Hamming distance dH(s′,t′) (because s′ and t′ have the same length and already include gap symbols to denote insertions/deletions).
As a result, we obtain Counting Optimal Alignments - 图1, where the minimum is taken over all alignments of s and t. Strings s′ and t′ achieving this minimum correspond to an optimal alignment with respect to edit alignment score.
Given: Two protein strings s and t in FASTA format, each of length at most 1000 a.
Return: The total number of optimal alignments of s and t with respect to edit alignment score, modulo 134,217,727 (2-1).

Sample Dataset

Rosalind_78
PLEASANTLY
>Rosalind_33
MEANLY

Sample Output

4

Why Are We Counting Modulo 134,217,727?

As a simple example, say that we attempt to count some statistic modulo 10. If computing the statistic requires us to multiply a collection of integers, and at any point we multiply by a multiple of 10, then the statistic will automatically become a multiple of 10 and thus congruent to 0 modulo 10. A similar issue can arise when we count a huge number modulo any composite number; however, if we count modulo a large prime number p (i.e., one without any divisors other than itself), then problems can only ever arise if when counting our statistic, we multiply by a multiple of p.

Solution

其实在这题就是记录编辑距离的最优路径来源,也就是判断 dp[i][j] = min(dp[i-1][j-1] + (s[i-1] != t[j-1]), dp[i][j-1] + 1, dp[i-1][j] + 1);dp[i][j] 来自这三个最小值种个树。 ```cpp

include

include

include

include

include

using namespace std;

class Solution { public: int countingOptimalAlignments(string &s, string &t) { const int modulus = 134217727; int m = s.size(), n = t.size(); // dp[i][j][0] 表示 s[0,i-1] -> t[0,j-1] 的最小编辑次数;dp[i][j][1]表示最优路径条数 vector>> dp(m + 1, vector> (n + 1, {0, 1})); // 1. 状态初始化 for (int i = 1; i <= m; ++i) { // 删除 s[i-1] dp[i][0].first = dp[i-1][0].first + 1; } for (int j = 1; j <= n; ++j) { // 删除 t[j-1] dp[0][j].first = dp[0][j-1].first + 1; } // 2. 状态转移方程 for (int i = 1; i <= m; ++i) { for (int j = 1; j <= n; ++j) { // (1) 来自对角线 int diagonal = dp[i-1][j-1].first + (s[i-1] != t[j-1]); // (2) s[i-1] 插入 t[j-1] 元素(即 删除 t[j-1] 元素) int left = dp[i][j-1].first + 1; // (3) t[j-1] 插入 s[i-1] 元素(即 删除 s[i-1] 元素) int up = dp[i-1][j].first + 1; // 取最小来源 dp[i][j].first = min({diagonal, left, up}); // 计算路径来源 int cnt = 0; if (diagonal == dp[i][j].first) { cnt = (cnt + dp[i-1][j-1].second) % modulus; } if (up == dp[i][j].first) { cnt = (cnt + dp[i-1][j].second) % modulus; } if (left == dp[i][j].first) { cnt = (cnt + dp[i][j-1].second) % modulus; } dp[i][j].second = cnt; } } return dp[m][n].second; } };

vector readFasta() { vector seqs; string line, seq; while (getline(cin, line)) { if (0 == line.find(‘>’)) { if (!seq.empty()) seqs.push_back(seq); seq.clear(); } else { seq += line; } } if (!seq.empty()) seqs.push_back(seq); // 别忘记最后还有一个序列 return seqs; }

int main () { vector seqs = readFasta(); cout << Solution().countingOptimalAlignments(seqs[0], seqs[1]) << endl; return 0; } ```