Pitfalls of Reversing Translation

When researchers discover a new protein, they would like to infer the strand of mRNA from which this protein could have been translated, thus allowing them to locate genes associated with this protein on the genome. Unfortunately, although any RNA string can be translated into a unique protein string, reversing the process yields a huge number of possible RNA strings from a single protein string because most amino acids correspond to multiple RNA codons (see the RNA Codon Table). Because of memory considerations, most data formats that are built into languages have upper bounds on how large an integer can be: in some versions of Python, an “int” variable may be required to be no larger than 231−1231−1, or 2,147,483,647. As a result, to deal with very large numbers in Rosalind, we need to devise a system that allows us to manipulate large numbers without actually having to store large numbers.

Problem
For positive integers aa and nn, aa modulo nn (written amodnamodn in shorthand) is the remainder when aa is divided by nn. For example, 29mod11=729mod11=7 because 29=11×2+729=11×2+7.
Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that aa and bb are congruent modulo nn if amodn=bmodnamodn=bmodn; in this case, we use the notation a≡bmodna≡bmodn.
Two useful facts in modular arithmetic are that if a≡bmodna≡bmodn and c≡dmodnc≡dmodn, then a+c≡b+dmodna+c≡b+dmodn and a×c≡b×dmodna×c≡b×dmodn. To check your understanding of these rules, you may wish to verify these relationships for a=29a=29, b=73b=73, c=10c=10, d=32d=32, and n=11n=11.
As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.
Given: A protein string of length at most 1000 aa.
Return: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don’t neglect the importance of the stop codon in protein translation.)
Sample Dataset
MA
Sample Output
12
Hint
What does it mean intuitively to take a number modulo 1,000,000?

解题思路

其实本题就是问如何根据密码表,从蛋白质序列推导出所有的 RNA 序列(包含终止子)所有组合可能。由于结果可能很大,因此大部分编程语言整数都有范围,因此题目意思要让结果进行取 1000000 模的结果。

  1. UUU F CUU L AUU I GUU V
  2. UUC F CUC L AUC I GUC V
  3. UUA L CUA L AUA I GUA V
  4. UUG L CUG L AUG M GUG V
  5. UCU S CCU P ACU T GCU A
  6. UCC S CCC P ACC T GCC A
  7. UCA S CCA P ACA T GCA A
  8. UCG S CCG P ACG T GCG A
  9. UAU Y CAU H AAU N GAU D
  10. UAC Y CAC H AAC N GAC D
  11. UAA Stop CAA Q AAA K GAA E
  12. UAG Stop CAG Q AAG K GAG E
  13. UGU C CGU R AGU S GGU G
  14. UGC C CGC R AGC S GGC G
  15. UGA Stop CGA R AGA R GGA G
  16. UGG W CGG R AGG R GGG G

例如题目给定蛋白序列 MAMAUG 即启动子对应的蛋氨酸,然后接上 A (有 GCU, GCC, GCA, GCG 四种可能),再加上终止子(有 UAA, UAG, UGA 三种)。因此有 4 * 3 = 12 种。

出现这种原因就是氨基酸由多个密码子决定,即一对多的关系。 Inferring mRNA from Protein - 图1因此本题就是找到每个氨基酸对应有多少个密码子,最后累成即可,不要忘记终止密码子有 3 种可能。

  1. codonTable = """
  2. UUU F CUU L AUU I GUU V
  3. UUC F CUC L AUC I GUC V
  4. UUA L CUA L AUA I GUA V
  5. UUG L CUG L AUG M GUG V
  6. UCU S CCU P ACU T GCU A
  7. UCC S CCC P ACC T GCC A
  8. UCA S CCA P ACA T GCA A
  9. UCG S CCG P ACG T GCG A
  10. UAU Y CAU H AAU N GAU D
  11. UAC Y CAC H AAC N GAC D
  12. UAA Stop CAA Q AAA K GAA E
  13. UAG Stop CAG Q AAG K GAG E
  14. UGU C CGU R AGU S GGU G
  15. UGC C CGC R AGC S GGC G
  16. UGA Stop CGA R AGA R GGA G
  17. UGG W CGG R AGG R GGG G
  18. """
  19. # 1. 创建字符频率表
  20. cnt = {}
  21. for line in codonTable.splitlines():
  22. line = line.split()
  23. for i in range(0, len(line), 2):
  24. cnt[line[i+1]] = cnt.get(line[i+1], 0) + 1
  25. seqs = input()
  26. tot = 3 # Stop
  27. for base in seqs:
  28. tot = (tot * cnt[base]) % 1_000_000
  29. print(tot)

输入用例:

  1. MDVMIADTAIGKKPGDSRPDWLKIDQDHVCMLFPRRTWTDPVALLCRYCYMYWTKFACAFSHHDKSKAHPKHFPTQPWVWEANIIKQCKIGEAHLQDFLGNNFVDECTTYFHLGTEYAGRKWWDNWGIQGTDWVGSTGAALATCYEWPEMDCCNDHLKNQCKKVRDSCMCVYFKVTWKGWICIVDSYICLATVNYLKCPQVLFGKNCHDPFMSVGYKVQIIYRDIFWQVRRAWLTTQRVNIRWDPRPYSWQYVAEFSPWVKSGIMFTDQHEDTAVLQVYFARGEPFLFTNEECDMYIIDAPQSVNDQATEHRYLMFFREDPWFLAFLESMHNNMCDNMFWKEYKICTEDIEQCGRADAHFCKFWDYYHKRYVNCYIDPFDHYFAMVLEINEVNQEWWKYHERQYKCCFNHDDQRLTLCLCQWEQQNGCILSKLEMVKQPMVEVSRNMPVGDCQWIWVCRWWALGLSQRCRMQPQRHPIDSGSVGLILDCIGHECYEWKDVSALKKESHQEMADGVLFCFWNAEAADMNQPILHCPIQDQFKDGLYLWVKVYRNLKIPYGEEHSWKYHHFAAKWSHSIYGHSMQCDPEWVKIEMALMECELNETHGDFQLEQFASGYLYQECENVFNCLKHLDHMDLWNMVFFDMQGDMNWGQRERMEMSGYCWHTGHADKRWPVWNNMIMANMFCHLWKWLGFFYNQHLHHTNQEWTMQYEKHTHDAEFWQMWQCQAKQCMNHDMTAHRSSPKFWSWHSCGSQSMIAQYTYPYFFYNVYHFHQDTYDGTFQHLFNWSMWIYFNVWDEPTSTNWFPCKSMDFGQRNYRKMDDGEWFHIQDWLNESMRCADQQMIVMNLIDHGNHAAQPWVFDCKYKRYCYAPHWHGNVPTRAFFLWIRMVNPQPVGSRVDYDYLCVMRQLKFLPGAMLTGEIAVEWEDTICTFRVPTQIWFPLECRYGHNCYEQRYSWQCDNKCETHCEADRDKRYWKCPNEENVLKLYKSSSNFPQYA
  2. 938432