Identifying Unknown DNA Quickly

A quick method used by early computer software to determine the language of a given piece of text was to analyze the frequency with which each letter appeared in the text. This strategy was used because each language tends to exhibit its own letter frequencies, and as long as the text under consideration is long enough, software will correctly recognize the language quickly and with a very low error rate. See Figure 1 for a table compiling English letter frequencies.
You may ask: what in the world does this linguistic problem have to do with biology? Although two members of the same species will have different genomes, they still share the vast percentage of their DNA; notably, 99.9% of the 3.2 billion base pairs in a human genome are common to almost all humans (i.e., excluding people having major genetic defects). For this reason, biologists will speak of the human genome, meaning an average-case genome derived from a collection of individuals. Such an average case genome can be assembled for any species, a challenge that we will soon discuss.
The biological analog of identifying unknown text arises when researchers encounter a molecule of DNA from an unknown species. Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule. Thus, to analyze the symbol frequencies of DNA for comparison against a database, we compute the molecule’s GC-content, or the percentage of its bases that are either cytosine or guanine.
In practice, the GC-content of most eukaryotic genomes hovers around 50%. However, because genomes are so long, we may be able to distinguish species based on very small discrepancies in GC-content; furthermore, most prokaryotes have a GC-content significantly higher than 50%, so that GC-content can be used to quickly differentiate many prokaryotes and eukaryotes by using relatively small DNA samples.
Computing GC Content - 图1
Figure 1. The table above was computed from a large number of English words and shows for any letter the frequency with which it appears in those words. These frequencies can be used to reliably identify a piece of English text and differentiate it from that of another language. Taken from http://en.wikipedia.org/wiki/File:Englishletter_frequency(frequency).svg..svg.)

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.
In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

  1. >Rosalind_6404
  2. CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
  3. TCCCACTAATAATTCTGAGG
  4. >Rosalind_5959
  5. CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
  6. ATATCCATTTGTCAGCAGACACGC
  7. >Rosalind_0808
  8. CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
  9. TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

  1. Rosalind_0808
  2. 60.919540

解题思路

关键是如何用 C 处理按行输入问题,非常麻烦。

  1. #include <stdio.h>
  2. #include <string.h>
  3. #define N 30
  4. int main() {
  5. float maxContent = 0;
  6. char ch = getchar(), seqId[N], maxSeq[N];
  7. while (ch != EOF) {
  8. // 1. handle the seqID
  9. if ('>' == ch) {
  10. ch = getchar();
  11. int idx = 0;
  12. while ('\n' != ch && ch != EOF) {
  13. seqId[idx++] = ch;
  14. ch = getchar();
  15. }
  16. seqId[idx] = '\0';
  17. }
  18. // 2. handle the base(ch = '\n')
  19. int G = 0, C = 0, len = 0;
  20. while (ch != EOF && ch != '>') {
  21. G += ch == 'G';
  22. C += ch == 'C';
  23. len += ch != '\n';
  24. ch = getchar();
  25. }
  26. // 3. compare the max value
  27. if (((G + C) * 100.0) / len > maxContent) {
  28. strcpy(maxSeq, seqId);
  29. maxContent = ((G + C) * 100.0) / len;
  30. }
  31. }
  32. printf("%s\n%f\n", maxSeq, maxContent);
  33. return 0;
  34. }

测试用例集

  1. >Rosalind_0605
  2. ATGAATGTTTGACTACTGGTGCGCTCGGGACTCACTATGACATGAAAGGTAATACTTCAA
  3. AGGTTTCGAATGCTTACGTTTATCACCCAGGCCCAATACCAGAGGAATGACTCCAGGGCG
  4. CACGCTACCCAGAAGCGAGTCATGGACATTACTATAACCCAATGCGAGAAAGATTTACTG
  5. GTTGAGTTGAAAGGGTATGCCGGATACTCATCTTTGACGACTTTTGCTTGTGCTCTAGAA
  6. TTCTCGTGTTTGCTGGTAGGAGTGGACCAATTCATCTTTGCCTCGATAAGGAGGTTGCCC
  7. ACGGTGCCTTCATGCGGGCGCAAGCGTGCCCCCCTCTTTTATAAGCGCGCCTTCAAGTCG
  8. GTAGTGATTCCCCGGATTTAATCAGACACGTAACCTCATTTCACTTTGCGGTGATTTTGC
  9. TACAATAGCTCAGGTGCAATCAGCACAGTCTTAAGGCCCCAATAGGGAAGCCAGGTCTGG
  10. GCCCTTTGAGGCGGAAAGAATAGCCAATGCCAGATCGGTTATCAGGAACATTGTTTGCCG
  11. ATGACTGTCTAACAGAAGCGGTAGGGGAGATGATTTCCAGGTTCCTCGCGTTATCACGAG
  12. TACGAGTTCGGGGAACCTGTCTAGTCAGCAACCCGCAGTCACCAGATTGGGATGCGTGGC
  13. TCTGAACAGGACCCGGTGACGGCTGTTGGTGCTTCCCCTATTGGCGAGTGAGCATAGTAA
  14. GTCGTATAGGGTGCGTGTACTATTCAGGTAAGAATCGTTAATTTCTCCTATCCCCCGGAA
  15. CCCTGACTCTAAGCCAAATTCGACGCTAATCGCTAAAGCTATGTTTACACGGGATAGAGG
  16. TGCCTGCGAGTCCACTGAAGATCGAACACTAAAAGGGGGTAAGGCCATGTTTATGGCCTA
  17. ATCGGTATTAAAAAAGTA
  18. >Rosalind_2432
  19. GCATTAGGAGGTCTAGGCGACTCACGAGGTGGTTCTATACGCCACCTTATGAAAGGCAGT
  20. AGCAAGTCCTTCAAGTTCTATATGTGATGATGCAGACACAGCAGTTTAACGGATTAGCCA
  21. ACGAGCCTGTGGGAGGATGACTGAGGTCTCTCGCTAGTGGGCGTGACCACTTTATTAGGA
  22. GTCGTGCTCACTACGGATGAAATGTCCGTCATGATTGGGCAAGACCCACGGATTACAACA
  23. TATCTTTTTATCCCCGCACGCTACGAGCCCTTGAACCCCTAATCGCCACGCCAATCAATC
  24. TAGTGCGGCGAACCAATATGATAGCGAAGTGAACCGTGATCGGAGTCTATGTCGTCTTCT
  25. GTTTTTGGGACACGCACGAAAGTTCAACAAATCTTAGTCACTTGAGAATATGTTCGGTTC
  26. TATAGCACCCTAAGCCCGAAAGACACAGGAAGATCATGCCCCGTAATCACTGGGTACTAT
  27. CTAGGTCACAGGCCCCCATAGTACAAGAACTGCTAGGCGACCAGTGGAACACCCCTAAGG
  28. TATTGCACCGGGGGCCCCACTTAGCGCTTAAGCGTACATGAACCTGCTCGCGCATACCCA
  29. TTTGTCAGGGCGCGATAGCAAATACGTCGCTGGCTCGATTTTAGCGCGCTCGTCAACTAT
  30. ACATAGAGAATCTCAGCGAGGATGGGTAGACCTAACGACATCCTAGCCATCACATACCTT
  31. ACTCCGGGGTCTGTTGTCACCGACCGGCAGAGTATTACAGGACTCAATTACTTAGCATAT
  32. AGTTTGTGTAGCACACATAAATTCCCCTAACCTGGGCGCCACGGAAAGGTCCTATAGCGA
  33. GCATGATTACTTAAATCTTTGGAGGATTTGCTGTGCCGACCTATGGGAGAAGCAGTTTTT
  34. ACCGGAACATTGATGTAGCTGTTTGTCCCCTACACATCCCATTATATAAAAGTCCCATAA
  35. GGTACATATATCCAAAGTCAGATGCGAT
  36. >Rosalind_3691
  37. ATCGGCAGTTCGTAGGTTATTCATGGTGAACTAAGGGATTTGGGTCAATCTGCTACGACA
  38. CACTTCGCTAGAGCCTTTGAAGGGAATCTCACCGAGGCACTTGCCACTGTCCACGCACGG
  39. GGGTTTATATATCTGGCATGTCCTATTTTGTAGCGGCGGCTGCGTGGGACTCCAGTGCGA
  40. GTTCCTTCCCAGTATGATGAAAGCATACGACAAGCTGGGTATGCGAACGGGATGGTTGCG
  41. TCGGTATACGATATTACGTGACACGTGTTCGTCAATGCATTAAATCCACGTCTACTGTTA
  42. GACGGACCAATAAGTAGCCGGTCATAACGCCCAGAACTTTCAAAATCGTACGGGCAAGGC
  43. TTCACGAAGTGTCCCCCGATCTTTGCCCCATCTTAGGACGAGCAATGACATCCCAAGGAG
  44. GGTCTCGGCATACCATGACGTTATGGCGCCGATGACTAAAACCCCCTAATATCGACTGCG
  45. GAGTCGTCCTGTTTGTGTAGAATTTTATAATTTAGGTTGAAACAGCCTTTTGGCCTTCAT
  46. TGGAACCGTTTACCAGATATTCAGGCTTCAGCCAGCACTCAACCACGCCCGGACGTACGC
  47. CCGGACTCGCAAAGGATTTTTCATAAGGATATTGAATAAAAAGTGGCGAGGGAAATACTT
  48. TGCGTGGCTGCCGATGCCGAACGGCCGATATCTCAGCACCGAGATAAACAGGTGAAAGTA
  49. TATGCAGACGCAATCTCGTACGGAGATCATTGGATGATCCCGTATGCCTAACGTTACATT
  50. GAATACAACCGTTCCCGTTCAGGAAGACTGGCCAAACGGTGTTACTCAGTCACTGCCGCT
  51. ACTCTGTTGGCACTTTAGAAAATGGTTGCCAATCTTCGCAACACTTGATTTCGCCGCTGA
  52. TTAGCGAAACGCGGGAATTCACTACACTTCTCCGTGCGCTACCAATAAGAGTACAGGCGC
  53. CG
  54. >Rosalind_5892
  55. GCTACTTGCATCACCATGCAGCAACGCTACGTATAGGGCACATCCCACGTACGGAAGATT
  56. GCTTGCTTCCATCTCTTGGGCACGGACTGCACGATGCATTGTGCGACCTCAGAGCTCATC
  57. TTCCAGATTGTCTTCCGTTAGTGCCGCCCCAGTTTGGGTCCTGAGGCTCTATTGCGTTCC
  58. GCAGTTGCCTTACGACATCGAGTATTTTCGTCGCCTAGCGTCGCTGCCTAATCTCTAACT
  59. TTGGTTGAACGGCCGACGAGGTGATAGAACAAAGGATTTACTTCTCTACGTATTGCACGA
  60. TCGCTGCTGGTGCCAAAGGCAATGTAGAATAGTACCCTAACAAACAGAGGTAACACCTAG
  61. AGCGTCCGTTTCCCTAATTTCTGTTAATGCCACATTTTTCGGATTTTAGTGCATGAGCTT
  62. TCGTAATTTTAGAATTTGAGACCTATTTCAATACATCATTGAGAGCTTGCTCCGTTACTC
  63. GATCCCCACCAATGAAGCAAGATTCAAGGGCGGGAAACCCCTCCAAACCCTGGGAAACAG
  64. AAACGCTTTACTAACGCTCAACAGCACAAGTCCTAGCTGACGACGACGCGTGTCTTCTTG
  65. TTATATCTAAAAACCACGGATTATGTCTAGACTATGACAGCTTTTACGTGGCGACCGAAA
  66. GGACGACAGTTAACTTTACGGTCGGGGTAGTCAAACGCGTAACTAGCCGACGCACTGGTC
  67. ATTAAAGGGCGCCGAAACATATTGCTAATTTGGGTCTTAGTATATCATCACATGCCCTTG
  68. TTACCACGTTCCGCTTGCGCACTACACCGGTCCCCGAACACTTAAACACTCCCTCGAGGA
  69. TATGTAGTACCTGCCCCCATGGAAGGGGTGTGCCGCGTGTAGTGGCTCTTATCCTTTTTT
  70. A
  71. >Rosalind_2852
  72. ATTATGGGATAACCCAAAATGATCGATGTCGATCGAGTAAATAGCCATGTTTAGAGTGAG
  73. ATCTGCTCATATGTATTTTCAGTCCACGGGTATTTCTACTCCACAGCGTCGGAATCTTCC
  74. GGACCTACAGCGCGGCTCGGTGTGCGCACACGGATAATCCGGGTCAAATAACGCCCTCCG
  75. CACAACGAGAGGCAACTTAAAGGAGCTATGTAGGTGTATTAATCTACGAAGTCTACCACA
  76. CTAACACTACCTTAGGCACGTAGACTGCAGTGGGCAGCTGTTGGCCACATAATATAGGTT
  77. GGGAATGAACATGGCTGTCTACTATAAGATTCGACTTGGCCGACGTGGGGCTCGTATCGG
  78. ATCGAGCTTGTCCCCAAAATGCAGTTTATTCCGAGCGTTCTCCGGTGTGGACACTAACTG
  79. CCAAAGCCTACCACATAGGTCTGCAGGGAGGCATCTTCTCCTGTACAATTCGATTCGGTA
  80. GAAATGAGATGAGTGGAGTCAGCCATCGAGACTGAAGTTCTGTAAAGGGCTGAGTTCAGT
  81. TTATTAACCTAGCGACTGTACGTTCTAAGGATTCGTACGAGTATACCTCCTCCCCGTACG
  82. GGGGTTCATGTTGCGTGGCGTTAGCCCGTTATGATCAGAAGAAGAGCAAACCTTAGCGGG
  83. TATGACGGTCGTGATTGTATCGTGGAAATTTCTATGTGGAGGCTAGTGGATTTGACTGTG
  84. AGCAAGCGAACCCCAACGGGGACCGATTGGTCGTACAGCCTAAGTGGTAGTCTGATCAAG
  85. TGGCCAAAAGCAGATTTAGACCCGCCGACAGGTAGATGACACTAGTTTTTGTGAAGCCCT
  86. ACCGTACTGGCCGTCGGTAAGTGGGAGTTTCATCCCACCTC

编译运行:

  1. b12@PC:~/ROSALIND/Computing_GC_Content$ gcc -Wall ./GCcontent.c -o ./GCcontent
  2. b12@PC:~/ROSALIND/Computing_GC_Content$ ./GCcontent < ./dataset.fa
  3. Rosalind_3691
  4. 49.376301