Implement MotifEnumeration暴力

rosalind.info_problems_ba2a_.png

1、枚举k-mer所有可能+在允许d次错配条件下,某k-mer是否都在每条DNA中

由于引入汉明距离不超过d的条件下,k-mer串可能非某条DNA的子串。例如案例结果中ATA不是任一DNA串的子串。因此我们需要枚举长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图2的k-mer串的所有可能情况,然后将每个k-mer串去检查每条DNA中是否存与该k-mer串的汉明距离≤d的子串

  1. from typing import List
  2. def hammingDistance(s: str, t: str) -> int:
  3. dist = 0
  4. for ch1, ch2 in zip(s, t):
  5. if ch1 != ch2:
  6. dist += 1
  7. return dist
  8. def MotifEnumeration(DNAs: List[str], k: int, d: int) -> List[str]:
  9. from itertools import product
  10. ans = []
  11. for kmer in product('ACGT', repeat=k): # (笛卡尔乘积)所有k-mer串
  12. appearCnt = 0
  13. for dna in DNAs:
  14. # 检查当前dna串中是否存在与kmer的汉明距离≤d的子串
  15. for i in range(len(dna) - k + 1):
  16. # 可优化:不看完整串,当s[,i]与t[,i]的距离>d时直接pass
  17. if hammingDistance(dna[i:i+k], kmer) <= d:
  18. appearCnt += 1
  19. break
  20. # 可优化:如果当前dna串中不存在,可以认定kmer串非答案
  21. if appearCnt == len(DNAs):
  22. ans.append("".join(kmer)) # ('A', 'T', 'A')
  23. return ans
  24. s = """3 1
  25. ATTTGGC
  26. TGCCTTA
  27. CGGTATC
  28. GAAAATT
  29. """
  30. kd, *DNAs = s.splitlines()
  31. k, d = map(int, kd.split())
  32. print(*MotifEnumeration(DNAs, k, d))

复杂度分析:设总DNA串数为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图3,每条DNA平均长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图4

  • 时间复杂度2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图5
  • 空间复杂度2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图6

    2、书本上做法

    书上代码思路:
  1. 对每个DNA串的所有长度为k的子串搜索其邻居串放入集合,集合长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图7,时间复杂度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图8
  2. 对所有DNA产生的的集合取公共交集,时间复杂度不超过2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图9 ```python from FrequentWordsWithMismatch import neighbours

def motifEnumeration (dna, k, d): ‘’’ Function to generate all kmers that are present in all strands of dna with at most d mismatches ‘’’

  1. ## Create an empty set for each DNA strand and fill it with all k-mers
  2. ## with at most d mismatches
  3. strand_kmers = [set() for _ in range(len(dna))]
  4. for count,strand in enumerate(dna):
  5. for i in range(len(strand) - k + 1):
  6. kmer = strand[i:i+k]
  7. strand_kmers[count].update(neighbours(kmer,d))
  8. ## Now intersect all sets to end up with only the sequences that occur
  9. ## in all strands
  10. result = strand_kmers[0]
  11. for i in range(1,len(strand_kmers)):
  12. result.intersection_update(strand_kmers[i])
  13. return ' '.join(list(result))

代码来源:https://rosalind.info/problems/ba2a/solutions/#comment-9887

  1. 当![](https://cdn.nlark.com/yuque/__latex/56c1b0cb7a48ccf9520b0adb3c8cb2e8.svg#card=math&code=d&id=JkX3G)较大时,显然此法预处理将会变得非常慢。
  2. <a name="ZGYqg"></a>
  3. # [Find a Median String](https://rosalind.info/problems/ba2b/)逆向思维
  4. ![rosalind.info_problems_ba2b_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1652435932364-f3dfc49c-23e9-4e5b-8bde-da1e428c18da.png#clientId=u346c0bca-61ca-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=uefa5f7e9&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2b_.png&originHeight=1008&originWidth=825&originalType=binary&ratio=1&rotation=0&showTitle=false&size=81806&status=done&style=shadow&taskId=ub289f5a2-2abc-43da-8c85-41b82ddba9a&title=)
  5. <a name="JuyBj"></a>
  6. ## 1、正向暴力:枚举所有Motifs→Consensus
  7. ```python
  8. from typing import List
  9. def hammingDistance(s: str, t: str) -> int:
  10. dist = 0
  11. for ch1, ch2 in zip(s, t):
  12. if ch1 != ch2:
  13. dist += 1
  14. return dist
  15. def medianString(DNAs: List[str], k: int) -> List[str]:
  16. from itertools import product # 笛卡尔乘积
  17. from collections import Counter
  18. # 枚举每个DNA长度为k的所有子串
  19. dnaKmers = []
  20. for dna in DNAs:
  21. kmers = []
  22. for i in range(len(dna) - k + 1):
  23. kmers.append(dna[i:i+k])
  24. dnaKmers.append(kmers)
  25. # 枚举所有Motif
  26. ans, ansScore = set(), float("inf")
  27. tmpConsensus = [''] * k
  28. for motif in product(*dnaKmers): # 每个dna中选一个kmer出来构成Motif
  29. # 统计所有列小写字母(除去出现最多的字母后剩余字母)的个数
  30. # print(motif, type(motif))
  31. # motif: [kmer1, kmer2,..., kmert]
  32. score = 0
  33. for j, cols in enumerate(zip(*motif)): # cols: ('A',.., 'T',.., 'C',.., 'G')
  34. # print(cols)
  35. totCnt, maxCnt = 0, 0
  36. for base, cnt in Counter(cols).items():
  37. if maxCnt < cnt:
  38. maxCnt = cnt
  39. tmpConsensus[j] = base # 更新第j列保守碱基
  40. totCnt += cnt
  41. score += totCnt - maxCnt
  42. # 更新每个motif得分最小值
  43. if ansScore >= score:
  44. if ansScore > score:
  45. ans.clear()
  46. ansScore = score
  47. # tmpConsensus必须转字符串,否则引用修改问题
  48. ans.add("".join(tmpConsensus)) # 可能重复,集合去重
  49. return list(ans)
  50. s = """3
  51. AAATTGACGCAT
  52. GACGACCACGTT
  53. CGTCAGCGCCTG
  54. GCTGAGCACCGG
  55. AGTACGGGACAG
  56. """
  57. k, *DNAs = s.splitlines()
  58. print(*medianString(DNAs, int(k)))

复杂度分析:设总DNAs串数为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图10,每串DNA平均长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图11,pattern串长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图12

  • 时间复杂度2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图13,枚举每个DNA长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图14的所有子串的时间复杂度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图15,从每条DNA序列中提取一条组成2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图16,因此2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图17总个数为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图18,对于每个2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图19需要2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图20时间检查。
  • 空间复杂度2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图21,长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图22的所有pattern串可能结果。

    2、逆向暴力:枚举所有Consensus→Motifs

    书本上思路是逆向进行的,枚举Consensus串的所有可能,在所有串中找与DNAs的总汉明距离最小的串。 ```python from typing import List

def hammingDistance(s: str, t: str) -> int: dist = 0 for ch1, ch2 in zip(s, t): if ch1 != ch2: dist += 1 return dist

def medianString(DNAs: List[str], k: int) -> List[str]: from itertools import product # 笛卡尔乘积 ans, ansDist = [], float(“inf”) for pattern in product(‘ACGT’, repeat=k): # 所有k-mer串 totDist = 0 # d(Pattern, Dna) for dna in DNAs:

  1. # 获得当前dna串的所有长度为k的子串与kmer串的距离的最小值 d(Pattern, Text)
  2. minDist = float("inf")
  3. for i in range(len(dna) - k + 1):
  4. minDist = min(minDist, hammingDistance(dna[i:i+k], pattern))
  5. totDist += minDist
  6. # 更新答案:题目虽只输出一条,但是我们记录所有相等答案
  7. if totDist < ansDist:
  8. ans, ansDist = [pattern], totDist
  9. elif totDist == ansDist:
  10. ans.append(pattern)
  11. return ["".join(tup) for tup in ans]

s = “””3 AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTACGGGACAG “”” k, DNAs = s.splitlines() print(medianString(DNAs, int(k)))

  1. **复杂度分析**:设总DNAs串数为![](https://cdn.nlark.com/yuque/__latex/cead1760d9d5723460c4b8d4028f113a.svg#card=math&code=t&id=GdHOE),每串DNA平均长度为![](https://cdn.nlark.com/yuque/__latex/df378375e7693bdcf9535661c023c02e.svg#card=math&code=n&id=aVowC),pattern串长度为![](https://cdn.nlark.com/yuque/__latex/e9d33fd26af9a5f16b4cd386fabb7386.svg#card=math&code=k%20%5Cll%20n&id=lfeEB)
  2. - **时间复杂度**:![](https://cdn.nlark.com/yuque/__latex/c49d2f5846d7ec0ab7b8741821fc834e.svg#card=math&code=O%284%5Ek%5Ccdot%20t%5Ccdot%20nk%29&id=hrs9O),长度为![](https://cdn.nlark.com/yuque/__latex/df976ff7fcf17d60490267d18a1e3996.svg#card=math&code=k&id=wD01a)的所有pattern串可能共有![](https://cdn.nlark.com/yuque/__latex/08c8c2bbe447aeea7bbd83a1299a2ac3.svg#card=math&code=4%5Ek&id=Qmxkb)种,每一种查询![](https://cdn.nlark.com/yuque/__latex/26839e874721c6d453fe3d406af6ac2e.svg#card=math&code=d%28pattern%2C%20DNAs%29&id=pedXv)需要进行![](https://cdn.nlark.com/yuque/__latex/cead1760d9d5723460c4b8d4028f113a.svg#card=math&code=t&id=hI3Oa)次时间复杂度为![](https://cdn.nlark.com/yuque/__latex/399567e5a001ea94d6c4077e4ed5496e.svg#card=math&code=O%28nk%29&id=VLC5J)的![](https://cdn.nlark.com/yuque/__latex/b857e2ea2995ba5271ef0d8ec78bec48.svg#card=math&code=d%28pattern%2Ctext%29&id=hhqHO)查询。
  3. - **空间复杂度**:![](https://cdn.nlark.com/yuque/__latex/61e7991d57cd5eac1309aa4d9977af54.svg#card=math&code=O%284%5Ek%29&id=tK5Cg),长度为![](https://cdn.nlark.com/yuque/__latex/df976ff7fcf17d60490267d18a1e3996.svg#card=math&code=k&id=Rw1VE)的所有pattern串可能结果。
  4. <a name="HbTyy"></a>
  5. # [Find a Profile-most Probable k-mer in a String](https://rosalind.info/problems/ba2c/)概率
  6. ![rosalind.info_problems_ba2c_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1652444256544-e9525f19-3a08-4692-8e23-43b344ddc541.png#clientId=u346c0bca-61ca-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=u0dbc366f&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2c_.png&originHeight=845&originWidth=905&originalType=binary&ratio=1&rotation=0&showTitle=false&size=65101&status=done&style=shadow&taskId=u0202b1cd-5899-4382-9856-f8cac74e647&title=)
  7. <a name="YNEWM"></a>
  8. ## 1、统计text所有长度为k的子串的最大概率
  9. ```python
  10. from typing import List
  11. def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> List[str]:
  12. # print(text, k, profile)
  13. # 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
  14. base2row = dict(zip('ACGT', range(4)))
  15. ans, ansProbable = set(), 0
  16. for i in range(len(text) - k + 1):
  17. # 计算pattern的概率
  18. pattern = text[i:i+k]
  19. pr = 1
  20. for i, base in enumerate(pattern):
  21. pr *= profile[base2row[base]][i]
  22. # 更新最大概率的答案串
  23. if ansProbable <= pr:
  24. if ansProbable < pr:
  25. ans.clear()
  26. ansProbable = pr
  27. ans.add("".join(pattern))
  28. return list(ans)
  29. s = """ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT
  30. 5
  31. 0.2 0.2 0.3 0.2 0.3
  32. 0.4 0.3 0.1 0.5 0.1
  33. 0.3 0.3 0.5 0.2 0.4
  34. 0.1 0.2 0.1 0.1 0.2
  35. """
  36. text, k, *profile = s.splitlines()
  37. profile = [[float(v) for v in line.split()] for line in profile]
  38. print(*profileMostProbableKmer(text, int(k), profile))

复杂度分析:设总text长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图23,pattern串长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图24

  • 时间复杂度2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图25,text中共有2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图26个长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图27的子串,计算每个长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图28的子串的概率为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图29
  • 空间复杂度2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图30,text中共有2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图31个长度为2、WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS? - 图32的子串。

    Implement GreedyMotifSearch

    rosalind.info_problems_ba2d_.png

    1、已有Motif串中构建Profile,然后找当前串中Profile-most Probable的k-mer子串作为Motif

    ```python from typing import List

def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> str:

  1. # print(text, k, profile)
  2. # 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
  3. base2row = dict(zip('ACGT', range(4)))
  4. ans, ansProbable = "", -1
  5. for i in range(len(text) - k + 1):
  6. # 计算pattern的概率
  7. pattern = text[i:i+k]
  8. pr = 1
  9. for i, base in enumerate(pattern):
  10. pr *= profile[base2row[base]][i]
  11. # 更新最大概率的答案串
  12. if ansProbable < pr:
  13. ans, ansProbable = pattern, pr
  14. return ans

def MotifsProfile(motifs: List[str]) -> List[float]: “”” Generating Profile from motifs, Profile[0][j]: the probable of A base in postion j Profile[1][j]: the probable of C base in postion j Profile[2][j]: the probable of G base in postion j Profile[3][j]: the probable of T base in postion j “”” from collections import Counter m, n = len(motifs), len(motifs[0]) base2row = dict(zip(‘ACGT’, range(4))) profile = [[0.0] n for _ in range(4)] for j, cols in enumerate(zip(motifs)): for base, cnt in Counter(cols).items(): profile[base2row[base]][j] = cnt / m return profile

def MotifsScore(motifs: List[str]) -> int: “”” SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column. “”” from collections import Counter score = 0 for cols in zip(*motifs): cnt = Counter(cols) upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基 for k, v in cnt.items(): if k != upper_case: score += v

  1. return score

def GreedyMotifSearch(dnas: List[str], k: int, t: int) -> List[str]: n = len(dnas[0]) # 每条DNA等长 bestMotifs, bestScore = [], float(“inf”) # 得分越小越好

  1. # 枚举第一条DNA的所有k-mer子串
  2. for i in range(n - k + 1):
  3. motifs = [dnas[0][i:i+k]]
  4. for j in range(1, t):
  5. # 从已有Motifs中生成Profile
  6. profile = MotifsProfile(motifs)
  7. # print(motifs, profile)
  8. # 第j条DNA的Profile-most probable串作为该条DNA候选Motif
  9. motifs.append(profileMostProbableKmer(dnas[j], k, profile))
  10. # 更新最小Motifs得分
  11. currScore = MotifsScore(motifs)
  12. if currScore < bestScore:
  13. bestMotifs, bestScore = motifs, currScore
  14. # print(bestScore)
  15. return bestMotifs

s = “””3 5 GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG “”” kt, *dnas = s.splitlines()

print(GreedyMotifSearch(dnas, map(int, kt.split())), sep=”\n”)

  1. <a name="J22gb"></a>
  2. ## Explanation
  3. ![rosalind.info_problems_ba2d_explanation_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1653051143155-cb43879e-d92d-410a-b32b-02f078c920cc.png#clientId=u3fd1400d-dc36-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=ue7833333&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2d_explanation_.png&originHeight=376&originWidth=598&originalType=binary&ratio=1&rotation=0&showTitle=false&size=29409&status=done&style=shadow&taskId=ub82219c0-d6ef-45e9-aebd-5ef17dc6b13&title=)
  4. <a name="shUM9"></a>
  5. # [Implement GreedyMotifSearch with Pseudocounts](https://rosalind.info/problems/ba2e/)
  6. ![rosalind.info_problems_ba2e_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1653053835239-0739bcc7-5b6d-48ef-9356-8e16b53918a7.png#clientId=u3fd1400d-dc36-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=ud09cde6d&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2e_.png&originHeight=845&originWidth=873&originalType=binary&ratio=1&rotation=0&showTitle=false&size=53772&status=done&style=shadow&taskId=u9ada3834-5819-463d-a312-4cc2510b123&title=)
  7. <a name="VZgjk"></a>
  8. ## 1、Laplace's Rule of Succession
  9. 根据拉普拉斯继承法则,让概率为0的事件变为较小值事件。<br />![image.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1653054040668-acdd6757-bf8d-4aa8-8eaf-16099b56273e.png#clientId=u3fd1400d-dc36-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=687&id=ubc184d26&margin=%5Bobject%20Object%5D&name=image.png&originHeight=859&originWidth=1099&originalType=binary&ratio=1&rotation=0&showTitle=false&size=133755&status=done&style=shadow&taskId=u902bf25e-6066-4a08-9132-fffc5dabb5b&title=&width=879.2)
  10. ```python
  11. from typing import List
  12. def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> str:
  13. # print(text, k, profile)
  14. # 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
  15. base2row = dict(zip('ACGT', range(4)))
  16. ans, ansProbable = "", -1
  17. for i in range(len(text) - k + 1):
  18. # 计算pattern的概率
  19. pattern = text[i:i+k]
  20. pr = 1
  21. for i, base in enumerate(pattern):
  22. pr *= profile[base2row[base]][i]
  23. # 更新最大概率的答案串
  24. if ansProbable < pr:
  25. ans, ansProbable = pattern, pr
  26. return ans
  27. def MotifsProfile(motifs: List[str]) -> List[float]:
  28. """
  29. Generating Profile from motifs,
  30. Profile[0][j]: the probable of A base in postion j
  31. Profile[1][j]: the probable of C base in postion j
  32. Profile[2][j]: the probable of G base in postion j
  33. Profile[3][j]: the probable of T base in postion j
  34. """
  35. from collections import Counter
  36. m, n = len(motifs), len(motifs[0])
  37. base2row = dict(zip('ACGT', range(4)))
  38. profile = [[1 / (m + m)] * n for _ in range(4)] # Laplace's Rule of Succession
  39. for j, cols in enumerate(zip(*motifs)):
  40. for base, cnt in Counter(cols).items():
  41. # profile[base2row[base]][j] = cnt / m
  42. profile[base2row[base]][j] = (cnt + 1) / (m + m) # Laplace's Rule of Succession
  43. return profile
  44. def MotifsScore(motifs: List[str]) -> int:
  45. """
  46. SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column.
  47. """
  48. from collections import Counter
  49. score = 0
  50. for cols in zip(*motifs):
  51. cnt = Counter(cols)
  52. upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基
  53. for k, v in cnt.items():
  54. if k != upper_case:
  55. score += v
  56. return score
  57. def GreedyMotifSearch(dnas: List[str], k: int, t: int) -> List[str]:
  58. n = len(dnas[0]) # 每条DNA等长
  59. bestMotifs, bestScore = [], float("inf") # 得分越小越好
  60. # 枚举第一条DNA的所有k-mer子串
  61. for i in range(n - k + 1):
  62. motifs = [dnas[0][i:i+k]]
  63. for j in range(1, t):
  64. # 从已有Motifs中生成Profile
  65. profile = MotifsProfile(motifs)
  66. # print(motifs, profile)
  67. # 第j条DNA的Profile-most probable串作为该条DNA候选Motif
  68. motifs.append(profileMostProbableKmer(dnas[j], k, profile))
  69. # 更新最小Motifs得分
  70. currScore = MotifsScore(motifs)
  71. if currScore < bestScore:
  72. # print(currScore, motifs)
  73. bestMotifs, bestScore = motifs, currScore
  74. # print(bestScore)
  75. return bestMotifs
  76. s = """3 5
  77. GGCGTTCAGGCA
  78. AAGAATCAGTCA
  79. CAAGGAGTTCGC
  80. CACGTCAATCAC
  81. CAATAATATTCG
  82. """
  83. kt, *dnas = s.splitlines()
  84. print(*GreedyMotifSearch(dnas, *map(int, kt.split())), sep="\n")

Run-ningGREEDYMOTIFSEARCHwith pseudocounts to solve the Subtle Motif Problem returns a collection of 15-mersMotifswithSCORE(Motifs)=41 andCONSENSUS(Motifs) =AAAAAtAgaGGGGtt. Thus, Laplace’s Rule of Succession has provided a great im-provement over the originalGREEDYMOTIFSEARCH, which returned the consensus stringgTtAAAtAgaGatGtGwith SCORE(Motifs)=58.
运行NingGreedyMotionSearchwith pseudocounts来解决微妙的Motif问题,返回15个mersMotifswithSCORE(Motif)=41和Consensus(Motif)=AAAAA Tagaggtt的集合。因此,拉普拉斯的继承规则大大改进了原始的GreedyMotionSearch,该搜索返回了一致的StringtTaaaTagagatgtg,分数(Motions)=58。

Implement RandomizedMotifSearch

rosalind.info_problems_ba2f_.png

1、1000次随机函数调用的最优解

随机函数就是书本上代码

  1. from typing import List
  2. from random import randint
  3. def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> str:
  4. # print(text, k, profile)
  5. # 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
  6. base2row = dict(zip('ACGT', range(4)))
  7. ans, ansProbable = "", -1
  8. for i in range(len(text) - k + 1):
  9. # 计算pattern的概率
  10. pattern = text[i:i+k]
  11. pr = 1
  12. for i, base in enumerate(pattern):
  13. pr *= profile[base2row[base]][i]
  14. # 更新最大概率的答案串
  15. if ansProbable < pr:
  16. ans, ansProbable = pattern, pr
  17. return ans
  18. def MotifsProfile(motifs: List[str]) -> List[float]:
  19. """
  20. Generating Profile from motifs,
  21. Profile[0][j]: the probable of A base in postion j
  22. Profile[1][j]: the probable of C base in postion j
  23. Profile[2][j]: the probable of G base in postion j
  24. Profile[3][j]: the probable of T base in postion j
  25. """
  26. from collections import Counter
  27. m, n = len(motifs), len(motifs[0])
  28. base2row = dict(zip('ACGT', range(4)))
  29. profile = [[1 / (m + m)] * n for _ in range(4)] # Laplace's Rule of Succession
  30. for j, cols in enumerate(zip(*motifs)):
  31. for base, cnt in Counter(cols).items():
  32. # profile[base2row[base]][j] = cnt / m
  33. profile[base2row[base]][j] = (cnt + 1) / (m + m) # Laplace's Rule of Succession
  34. return profile
  35. def MotifsScore(motifs: List[str]) -> int:
  36. """
  37. SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column.
  38. """
  39. from collections import Counter
  40. score = 0
  41. for cols in zip(*motifs):
  42. cnt = Counter(cols)
  43. upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基
  44. for k, v in cnt.items():
  45. if k != upper_case:
  46. score += v
  47. return score
  48. def RandomizedMotifSearch(dnas: List[str], k: int, t: int) -> List[str]:
  49. n = len(dnas[0]) # 每条DNA等长
  50. # 随机生成第一组Motifs
  51. motifs = []
  52. for i in range(t):
  53. beg = randint(0, n - k - 1) # [0,n-k-1]
  54. motifs.append(dnas[i][beg:beg+k]) # 该DNA的随机k-mer子串
  55. bestMotifs, bestScore = motifs, MotifsScore(motifs) # 当前Motifs当作最优解
  56. # 一直迭代找第一个“山谷”
  57. while True:
  58. # 从已有Motifs中生成Profile
  59. profile = MotifsProfile(motifs)
  60. # 在t条DNA中找到Profile-most probable串并组成新的Motifs
  61. motifs = [profileMostProbableKmer(dna, k, profile) for dna in dnas]
  62. # 更新最小Motifs得分
  63. currScore = MotifsScore(motifs)
  64. if currScore < bestScore:
  65. # print(currScore, motifs)
  66. bestMotifs, bestScore = motifs, currScore
  67. else:
  68. return bestMotifs # 书本上让证明一定存在“山谷”,即不会死循环
  69. s = """8 5
  70. CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA
  71. GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
  72. TAGTACCGAGACCGAAAGAAGTATACAGGCGT
  73. TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
  74. AATCCACCAGCTCCACGTGCAATGTTGGCCTA
  75. """
  76. kt, *dnas = s.splitlines()
  77. # 题目要求运行RandomizedMotifSearch函数1000次,可以使用并行实现
  78. bestMotifs, bestScore = [], float("inf")
  79. for _ in range(1000):
  80. motifs = RandomizedMotifSearch(dnas, *map(int, kt.split()))
  81. currScore = MotifsScore(motifs)
  82. if currScore < bestScore:
  83. # print(currScore, motifs)
  84. bestMotifs, bestScore = motifs, currScore
  85. print(*bestMotifs, sep="\n")

Implement GibbsSampler概率

rosalind.info_problems_ba2g_.png

1、吉布斯生成器【不等概率】

Actually there is no real problem. The GibbsSampler can only find a local optimum, then stabilises there after 100 - 200 runs. So you simply have to run the GibbsSampler multiple times to find the best scoring one of the local optima. It looks like you do this for 20 runs. I guess that is a little low. You run 20 2000 times. So 200 200 should bring better results within the same time.

  1. from typing import List
  2. from random import randint, uniform
  3. def profileRandomKmer(text: str, k: int, profile: List[List[float]]) -> str:
  4. # 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
  5. # 生成非均匀分布(p1,p2,..,pk),根据几何概型得到每个kmer出现概率
  6. # 将所有概率同时除去C=sum(p)得到均匀分布:(p1/C,p2/C,..,pk/C)
  7. # 现在随机生成0~1内的数,看它落在均匀分布的哪个区间,即第几条kmer
  8. base2row = dict(zip('ACGT', range(4)))
  9. ans, ansProbable = "", -1
  10. probabilities = []
  11. for i in range(len(text) - k + 1):
  12. # 计算pattern的概率(由于存在浮点数,无法使用滑窗优化)
  13. probability = 1.0
  14. for j in range(k):
  15. probability *= profile[base2row[text[i+j]]][j]
  16. probabilities.append(probability)
  17. # 随机生成0~1的浮点数,判断其落在(p1/C,p2/C,..,pk/C)均匀分布的哪个位置,即第几条kmer
  18. x = uniform(0, sum(probabilities))
  19. currP = 0
  20. for i, p in enumerate(probabilities):
  21. if currP + p >= x: # currP+p/C >= x
  22. return text[i:i+k]
  23. currP += p
  24. def MotifsProfile(motifs: List[str]) -> List[float]:
  25. """
  26. Generating Profile from motifs,
  27. Profile[0][j]: the probable of A base in postion j
  28. Profile[1][j]: the probable of C base in postion j
  29. Profile[2][j]: the probable of G base in postion j
  30. Profile[3][j]: the probable of T base in postion j
  31. """
  32. from collections import Counter
  33. m, n = len(motifs), len(motifs[0])
  34. base2row = dict(zip('ACGT', range(4)))
  35. profile = [[1 / (m + m)] * n for _ in range(4)] # Laplace's Rule of Succession
  36. for j, cols in enumerate(zip(*motifs)):
  37. for base, cnt in Counter(cols).items():
  38. # profile[base2row[base]][j] = cnt / m
  39. profile[base2row[base]][j] = (cnt + 1) / (m + m) # Laplace's Rule of Succession
  40. return profile
  41. def MotifsScore(motifs: List[str]) -> int:
  42. """
  43. SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column.
  44. """
  45. from collections import Counter
  46. score = 0
  47. for cols in zip(*motifs):
  48. cnt = Counter(cols)
  49. upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基
  50. for k, v in cnt.items():
  51. if k != upper_case:
  52. score += v
  53. return score
  54. def GibbsSampler(dnas: List[str], k: int, t: int, N: int) -> List[str]:
  55. n = len(dnas[0]) # 每条DNA等长
  56. # 随机生成第一组Motifs
  57. motifs = []
  58. for i in range(t):
  59. beg = randint(0, n - k - 1) # [0,n-k-1]
  60. motifs.append(dnas[i][beg:beg+k]) # 该DNA的随机k-mer子串
  61. bestMotifs, bestScore = motifs, MotifsScore(motifs) # 当前Motifs当作最优解
  62. # 进行N轮吉布斯抽样
  63. for _ in range(N):
  64. i = randint(0, t - 1)
  65. motifs.pop(i) # 删除第i条序列
  66. profile = MotifsProfile(motifs) # 更具另外t-1条生成Profile
  67. motifs.insert(i, profileRandomKmer(dnas[i], k, profile))
  68. currScore = MotifsScore(motifs)
  69. if currScore < bestScore:
  70. bestMotifs, bestScore = motifs, currScore
  71. return bestMotifs
  72. s = """8 5 100
  73. CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA
  74. GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
  75. TAGTACCGAGACCGAAAGAAGTATACAGGCGT
  76. TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
  77. AATCCACCAGCTCCACGTGCAATGTTGGCCTA
  78. """
  79. ktN, *dnas = s.splitlines()
  80. # 题目要求运行GibbsSampler函数20次,可以使用并行实现
  81. bestMotifs, bestScore = [], float("inf")
  82. for _ in range(50): # 但是20次对于案例依旧可能输出其它较大解
  83. motifs = GibbsSampler(dnas, *map(int, ktN.split()))
  84. currScore = MotifsScore(motifs)
  85. if currScore < bestScore:
  86. # print(currScore, motifs)
  87. bestMotifs, bestScore = motifs, currScore
  88. print(bestScore, *bestMotifs, sep="\n")

Explanation

rosalind.info_problems_ba2g_explanation_ (4).png

Implement DistanceBetweenPatternAndStrings

rosalind.info_problems_ba2h_.png

1、模拟

  1. from typing import List
  2. def hammingDistance(s: str, t: str) -> int:
  3. dist = 0
  4. for ch1, ch2 in zip(s, t):
  5. if ch1 != ch2:
  6. dist += 1
  7. return dist
  8. def DistanceBetweenPatternAndStrings(pattern: str, dnas: List[str]):
  9. k = len(pattern)
  10. totDist = 0
  11. for dna in dnas:
  12. minDist = float("inf")
  13. for i in range(len(dna) - k + 1):
  14. dist = hammingDistance(dna[i:i+k], pattern)
  15. if dist < minDist:
  16. minDist = dist
  17. totDist += minDist
  18. return totDist
  19. s = """AAA
  20. TTACCTTAAC GATATCTGTC ACGGCGTTCG CCCTAAAGAG CGTCAGAGGT
  21. """
  22. pattern, dnas = s.splitlines()
  23. print(DistanceBetweenPatternAndStrings(pattern, dnas.split()), sep="\n")