- Implement MotifEnumeration暴力">Implement MotifEnumeration暴力
- https://rosalind.info/problems/ba2a/solutions/#comment-9887">代码来源:https://rosalind.info/problems/ba2a/solutions/#comment-9887
- Implement GreedyMotifSearch">Implement GreedyMotifSearch
- Implement RandomizedMotifSearch">Implement RandomizedMotifSearch
- Implement GibbsSampler概率">Implement GibbsSampler概率
- Implement DistanceBetweenPatternAndStrings">Implement DistanceBetweenPatternAndStrings
Implement MotifEnumeration暴力
1、枚举k-mer所有可能+在允许d次错配条件下,某k-mer是否都在每条DNA中
由于引入汉明距离不超过d的条件下,k-mer串可能非某条DNA的子串。例如案例结果中ATA
不是任一DNA串的子串。因此我们需要枚举长度为的k-mer串的所有可能情况,然后将每个k-mer串去检查每条DNA中是否存与该k-mer串的汉明距离≤d的子串。
from typing import List
def hammingDistance(s: str, t: str) -> int:
dist = 0
for ch1, ch2 in zip(s, t):
if ch1 != ch2:
dist += 1
return dist
def MotifEnumeration(DNAs: List[str], k: int, d: int) -> List[str]:
from itertools import product
ans = []
for kmer in product('ACGT', repeat=k): # (笛卡尔乘积)所有k-mer串
appearCnt = 0
for dna in DNAs:
# 检查当前dna串中是否存在与kmer的汉明距离≤d的子串
for i in range(len(dna) - k + 1):
# 可优化:不看完整串,当s[,i]与t[,i]的距离>d时直接pass
if hammingDistance(dna[i:i+k], kmer) <= d:
appearCnt += 1
break
# 可优化:如果当前dna串中不存在,可以认定kmer串非答案
if appearCnt == len(DNAs):
ans.append("".join(kmer)) # ('A', 'T', 'A')
return ans
s = """3 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT
"""
kd, *DNAs = s.splitlines()
k, d = map(int, kd.split())
print(*MotifEnumeration(DNAs, k, d))
复杂度分析:设总DNA串数为,每条DNA平均长度为
- 对每个DNA串的所有长度为k的子串搜索其邻居串放入集合,集合长度为,时间复杂度为
- 对所有DNA产生的的集合取公共交集,时间复杂度不超过 ```python from FrequentWordsWithMismatch import neighbours
def motifEnumeration (dna, k, d): ‘’’ Function to generate all kmers that are present in all strands of dna with at most d mismatches ‘’’
## Create an empty set for each DNA strand and fill it with all k-mers
## with at most d mismatches
strand_kmers = [set() for _ in range(len(dna))]
for count,strand in enumerate(dna):
for i in range(len(strand) - k + 1):
kmer = strand[i:i+k]
strand_kmers[count].update(neighbours(kmer,d))
## Now intersect all sets to end up with only the sequences that occur
## in all strands
result = strand_kmers[0]
for i in range(1,len(strand_kmers)):
result.intersection_update(strand_kmers[i])
return ' '.join(list(result))
代码来源:https://rosalind.info/problems/ba2a/solutions/#comment-9887
当![](https://cdn.nlark.com/yuque/__latex/56c1b0cb7a48ccf9520b0adb3c8cb2e8.svg#card=math&code=d&id=JkX3G)较大时,显然此法预处理将会变得非常慢。
<a name="ZGYqg"></a>
# [Find a Median String](https://rosalind.info/problems/ba2b/)逆向思维
![rosalind.info_problems_ba2b_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1652435932364-f3dfc49c-23e9-4e5b-8bde-da1e428c18da.png#clientId=u346c0bca-61ca-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=uefa5f7e9&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2b_.png&originHeight=1008&originWidth=825&originalType=binary&ratio=1&rotation=0&showTitle=false&size=81806&status=done&style=shadow&taskId=ub289f5a2-2abc-43da-8c85-41b82ddba9a&title=)
<a name="JuyBj"></a>
## 1、正向暴力:枚举所有Motifs→Consensus
```python
from typing import List
def hammingDistance(s: str, t: str) -> int:
dist = 0
for ch1, ch2 in zip(s, t):
if ch1 != ch2:
dist += 1
return dist
def medianString(DNAs: List[str], k: int) -> List[str]:
from itertools import product # 笛卡尔乘积
from collections import Counter
# 枚举每个DNA长度为k的所有子串
dnaKmers = []
for dna in DNAs:
kmers = []
for i in range(len(dna) - k + 1):
kmers.append(dna[i:i+k])
dnaKmers.append(kmers)
# 枚举所有Motif
ans, ansScore = set(), float("inf")
tmpConsensus = [''] * k
for motif in product(*dnaKmers): # 每个dna中选一个kmer出来构成Motif
# 统计所有列小写字母(除去出现最多的字母后剩余字母)的个数
# print(motif, type(motif))
# motif: [kmer1, kmer2,..., kmert]
score = 0
for j, cols in enumerate(zip(*motif)): # cols: ('A',.., 'T',.., 'C',.., 'G')
# print(cols)
totCnt, maxCnt = 0, 0
for base, cnt in Counter(cols).items():
if maxCnt < cnt:
maxCnt = cnt
tmpConsensus[j] = base # 更新第j列保守碱基
totCnt += cnt
score += totCnt - maxCnt
# 更新每个motif得分最小值
if ansScore >= score:
if ansScore > score:
ans.clear()
ansScore = score
# tmpConsensus必须转字符串,否则引用修改问题
ans.add("".join(tmpConsensus)) # 可能重复,集合去重
return list(ans)
s = """3
AAATTGACGCAT
GACGACCACGTT
CGTCAGCGCCTG
GCTGAGCACCGG
AGTACGGGACAG
"""
k, *DNAs = s.splitlines()
print(*medianString(DNAs, int(k)))
复杂度分析:设总DNAs串数为,每串DNA平均长度为,pattern串长度为
- 时间复杂度:,枚举每个DNA长度为的所有子串的时间复杂度为,从每条DNA序列中提取一条组成,因此总个数为,对于每个需要时间检查。
- 空间复杂度:,长度为的所有pattern串可能结果。
2、逆向暴力:枚举所有Consensus→Motifs
书本上思路是逆向进行的,枚举Consensus串的所有可能,在所有串中找与DNAs的总汉明距离最小的串。 ```python from typing import List
def hammingDistance(s: str, t: str) -> int: dist = 0 for ch1, ch2 in zip(s, t): if ch1 != ch2: dist += 1 return dist
def medianString(DNAs: List[str], k: int) -> List[str]: from itertools import product # 笛卡尔乘积 ans, ansDist = [], float(“inf”) for pattern in product(‘ACGT’, repeat=k): # 所有k-mer串 totDist = 0 # d(Pattern, Dna) for dna in DNAs:
# 获得当前dna串的所有长度为k的子串与kmer串的距离的最小值 d(Pattern, Text)
minDist = float("inf")
for i in range(len(dna) - k + 1):
minDist = min(minDist, hammingDistance(dna[i:i+k], pattern))
totDist += minDist
# 更新答案:题目虽只输出一条,但是我们记录所有相等答案
if totDist < ansDist:
ans, ansDist = [pattern], totDist
elif totDist == ansDist:
ans.append(pattern)
return ["".join(tup) for tup in ans]
s = “””3 AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTACGGGACAG “”” k, DNAs = s.splitlines() print(medianString(DNAs, int(k)))
**复杂度分析**:设总DNAs串数为![](https://cdn.nlark.com/yuque/__latex/cead1760d9d5723460c4b8d4028f113a.svg#card=math&code=t&id=GdHOE),每串DNA平均长度为![](https://cdn.nlark.com/yuque/__latex/df378375e7693bdcf9535661c023c02e.svg#card=math&code=n&id=aVowC),pattern串长度为![](https://cdn.nlark.com/yuque/__latex/e9d33fd26af9a5f16b4cd386fabb7386.svg#card=math&code=k%20%5Cll%20n&id=lfeEB)
- **时间复杂度**:![](https://cdn.nlark.com/yuque/__latex/c49d2f5846d7ec0ab7b8741821fc834e.svg#card=math&code=O%284%5Ek%5Ccdot%20t%5Ccdot%20nk%29&id=hrs9O),长度为![](https://cdn.nlark.com/yuque/__latex/df976ff7fcf17d60490267d18a1e3996.svg#card=math&code=k&id=wD01a)的所有pattern串可能共有![](https://cdn.nlark.com/yuque/__latex/08c8c2bbe447aeea7bbd83a1299a2ac3.svg#card=math&code=4%5Ek&id=Qmxkb)种,每一种查询![](https://cdn.nlark.com/yuque/__latex/26839e874721c6d453fe3d406af6ac2e.svg#card=math&code=d%28pattern%2C%20DNAs%29&id=pedXv)需要进行![](https://cdn.nlark.com/yuque/__latex/cead1760d9d5723460c4b8d4028f113a.svg#card=math&code=t&id=hI3Oa)次时间复杂度为![](https://cdn.nlark.com/yuque/__latex/399567e5a001ea94d6c4077e4ed5496e.svg#card=math&code=O%28nk%29&id=VLC5J)的![](https://cdn.nlark.com/yuque/__latex/b857e2ea2995ba5271ef0d8ec78bec48.svg#card=math&code=d%28pattern%2Ctext%29&id=hhqHO)查询。
- **空间复杂度**:![](https://cdn.nlark.com/yuque/__latex/61e7991d57cd5eac1309aa4d9977af54.svg#card=math&code=O%284%5Ek%29&id=tK5Cg),长度为![](https://cdn.nlark.com/yuque/__latex/df976ff7fcf17d60490267d18a1e3996.svg#card=math&code=k&id=Rw1VE)的所有pattern串可能结果。
<a name="HbTyy"></a>
# [Find a Profile-most Probable k-mer in a String](https://rosalind.info/problems/ba2c/)概率
![rosalind.info_problems_ba2c_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1652444256544-e9525f19-3a08-4692-8e23-43b344ddc541.png#clientId=u346c0bca-61ca-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=u0dbc366f&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2c_.png&originHeight=845&originWidth=905&originalType=binary&ratio=1&rotation=0&showTitle=false&size=65101&status=done&style=shadow&taskId=u0202b1cd-5899-4382-9856-f8cac74e647&title=)
<a name="YNEWM"></a>
## 1、统计text所有长度为k的子串的最大概率
```python
from typing import List
def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> List[str]:
# print(text, k, profile)
# 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
base2row = dict(zip('ACGT', range(4)))
ans, ansProbable = set(), 0
for i in range(len(text) - k + 1):
# 计算pattern的概率
pattern = text[i:i+k]
pr = 1
for i, base in enumerate(pattern):
pr *= profile[base2row[base]][i]
# 更新最大概率的答案串
if ansProbable <= pr:
if ansProbable < pr:
ans.clear()
ansProbable = pr
ans.add("".join(pattern))
return list(ans)
s = """ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT
5
0.2 0.2 0.3 0.2 0.3
0.4 0.3 0.1 0.5 0.1
0.3 0.3 0.5 0.2 0.4
0.1 0.2 0.1 0.1 0.2
"""
text, k, *profile = s.splitlines()
profile = [[float(v) for v in line.split()] for line in profile]
print(*profileMostProbableKmer(text, int(k), profile))
复杂度分析:设总text长度为,pattern串长度为
- 时间复杂度:,text中共有个长度为的子串,计算每个长度为的子串的概率为
- 空间复杂度:,text中共有个长度为的子串。
Implement GreedyMotifSearch
1、已有Motif串中构建Profile,然后找当前串中Profile-most Probable的k-mer子串作为Motif
```python from typing import List
def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> str:
# print(text, k, profile)
# 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
base2row = dict(zip('ACGT', range(4)))
ans, ansProbable = "", -1
for i in range(len(text) - k + 1):
# 计算pattern的概率
pattern = text[i:i+k]
pr = 1
for i, base in enumerate(pattern):
pr *= profile[base2row[base]][i]
# 更新最大概率的答案串
if ansProbable < pr:
ans, ansProbable = pattern, pr
return ans
def MotifsProfile(motifs: List[str]) -> List[float]: “”” Generating Profile from motifs, Profile[0][j]: the probable of A base in postion j Profile[1][j]: the probable of C base in postion j Profile[2][j]: the probable of G base in postion j Profile[3][j]: the probable of T base in postion j “”” from collections import Counter m, n = len(motifs), len(motifs[0]) base2row = dict(zip(‘ACGT’, range(4))) profile = [[0.0] n for _ in range(4)] for j, cols in enumerate(zip(motifs)): for base, cnt in Counter(cols).items(): profile[base2row[base]][j] = cnt / m return profile
def MotifsScore(motifs: List[str]) -> int: “”” SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column. “”” from collections import Counter score = 0 for cols in zip(*motifs): cnt = Counter(cols) upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基 for k, v in cnt.items(): if k != upper_case: score += v
return score
def GreedyMotifSearch(dnas: List[str], k: int, t: int) -> List[str]: n = len(dnas[0]) # 每条DNA等长 bestMotifs, bestScore = [], float(“inf”) # 得分越小越好
# 枚举第一条DNA的所有k-mer子串
for i in range(n - k + 1):
motifs = [dnas[0][i:i+k]]
for j in range(1, t):
# 从已有Motifs中生成Profile
profile = MotifsProfile(motifs)
# print(motifs, profile)
# 第j条DNA的Profile-most probable串作为该条DNA候选Motif
motifs.append(profileMostProbableKmer(dnas[j], k, profile))
# 更新最小Motifs得分
currScore = MotifsScore(motifs)
if currScore < bestScore:
bestMotifs, bestScore = motifs, currScore
# print(bestScore)
return bestMotifs
s = “””3 5 GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG “”” kt, *dnas = s.splitlines()
print(GreedyMotifSearch(dnas, map(int, kt.split())), sep=”\n”)
<a name="J22gb"></a>
## Explanation
![rosalind.info_problems_ba2d_explanation_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1653051143155-cb43879e-d92d-410a-b32b-02f078c920cc.png#clientId=u3fd1400d-dc36-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=ue7833333&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2d_explanation_.png&originHeight=376&originWidth=598&originalType=binary&ratio=1&rotation=0&showTitle=false&size=29409&status=done&style=shadow&taskId=ub82219c0-d6ef-45e9-aebd-5ef17dc6b13&title=)
<a name="shUM9"></a>
# [Implement GreedyMotifSearch with Pseudocounts](https://rosalind.info/problems/ba2e/)
![rosalind.info_problems_ba2e_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1653053835239-0739bcc7-5b6d-48ef-9356-8e16b53918a7.png#clientId=u3fd1400d-dc36-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=ud09cde6d&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba2e_.png&originHeight=845&originWidth=873&originalType=binary&ratio=1&rotation=0&showTitle=false&size=53772&status=done&style=shadow&taskId=u9ada3834-5819-463d-a312-4cc2510b123&title=)
<a name="VZgjk"></a>
## 1、Laplace's Rule of Succession
根据拉普拉斯继承法则,让概率为0的事件变为较小值事件。<br />![image.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1653054040668-acdd6757-bf8d-4aa8-8eaf-16099b56273e.png#clientId=u3fd1400d-dc36-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=687&id=ubc184d26&margin=%5Bobject%20Object%5D&name=image.png&originHeight=859&originWidth=1099&originalType=binary&ratio=1&rotation=0&showTitle=false&size=133755&status=done&style=shadow&taskId=u902bf25e-6066-4a08-9132-fffc5dabb5b&title=&width=879.2)
```python
from typing import List
def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> str:
# print(text, k, profile)
# 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
base2row = dict(zip('ACGT', range(4)))
ans, ansProbable = "", -1
for i in range(len(text) - k + 1):
# 计算pattern的概率
pattern = text[i:i+k]
pr = 1
for i, base in enumerate(pattern):
pr *= profile[base2row[base]][i]
# 更新最大概率的答案串
if ansProbable < pr:
ans, ansProbable = pattern, pr
return ans
def MotifsProfile(motifs: List[str]) -> List[float]:
"""
Generating Profile from motifs,
Profile[0][j]: the probable of A base in postion j
Profile[1][j]: the probable of C base in postion j
Profile[2][j]: the probable of G base in postion j
Profile[3][j]: the probable of T base in postion j
"""
from collections import Counter
m, n = len(motifs), len(motifs[0])
base2row = dict(zip('ACGT', range(4)))
profile = [[1 / (m + m)] * n for _ in range(4)] # Laplace's Rule of Succession
for j, cols in enumerate(zip(*motifs)):
for base, cnt in Counter(cols).items():
# profile[base2row[base]][j] = cnt / m
profile[base2row[base]][j] = (cnt + 1) / (m + m) # Laplace's Rule of Succession
return profile
def MotifsScore(motifs: List[str]) -> int:
"""
SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column.
"""
from collections import Counter
score = 0
for cols in zip(*motifs):
cnt = Counter(cols)
upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基
for k, v in cnt.items():
if k != upper_case:
score += v
return score
def GreedyMotifSearch(dnas: List[str], k: int, t: int) -> List[str]:
n = len(dnas[0]) # 每条DNA等长
bestMotifs, bestScore = [], float("inf") # 得分越小越好
# 枚举第一条DNA的所有k-mer子串
for i in range(n - k + 1):
motifs = [dnas[0][i:i+k]]
for j in range(1, t):
# 从已有Motifs中生成Profile
profile = MotifsProfile(motifs)
# print(motifs, profile)
# 第j条DNA的Profile-most probable串作为该条DNA候选Motif
motifs.append(profileMostProbableKmer(dnas[j], k, profile))
# 更新最小Motifs得分
currScore = MotifsScore(motifs)
if currScore < bestScore:
# print(currScore, motifs)
bestMotifs, bestScore = motifs, currScore
# print(bestScore)
return bestMotifs
s = """3 5
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG
"""
kt, *dnas = s.splitlines()
print(*GreedyMotifSearch(dnas, *map(int, kt.split())), sep="\n")
Run-ningGREEDYMOTIFSEARCHwith pseudocounts to solve the Subtle Motif Problem returns a collection of 15-mersMotifswithSCORE(Motifs)=41 andCONSENSUS(Motifs) =AAAAAtAgaGGGGtt. Thus, Laplace’s Rule of Succession has provided a great im-provement over the originalGREEDYMOTIFSEARCH, which returned the consensus stringgTtAAAtAgaGatGtGwith SCORE(Motifs)=58.
运行NingGreedyMotionSearchwith pseudocounts来解决微妙的Motif问题,返回15个mersMotifswithSCORE(Motif)=41和Consensus(Motif)=AAAAA Tagaggtt的集合。因此,拉普拉斯的继承规则大大改进了原始的GreedyMotionSearch,该搜索返回了一致的StringtTaaaTagagatgtg,分数(Motions)=58。
Implement RandomizedMotifSearch
1、1000次随机函数调用的最优解
随机函数就是书本上代码
from typing import List
from random import randint
def profileMostProbableKmer(text: str, k: int, profile: List[List[float]]) -> str:
# print(text, k, profile)
# 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
base2row = dict(zip('ACGT', range(4)))
ans, ansProbable = "", -1
for i in range(len(text) - k + 1):
# 计算pattern的概率
pattern = text[i:i+k]
pr = 1
for i, base in enumerate(pattern):
pr *= profile[base2row[base]][i]
# 更新最大概率的答案串
if ansProbable < pr:
ans, ansProbable = pattern, pr
return ans
def MotifsProfile(motifs: List[str]) -> List[float]:
"""
Generating Profile from motifs,
Profile[0][j]: the probable of A base in postion j
Profile[1][j]: the probable of C base in postion j
Profile[2][j]: the probable of G base in postion j
Profile[3][j]: the probable of T base in postion j
"""
from collections import Counter
m, n = len(motifs), len(motifs[0])
base2row = dict(zip('ACGT', range(4)))
profile = [[1 / (m + m)] * n for _ in range(4)] # Laplace's Rule of Succession
for j, cols in enumerate(zip(*motifs)):
for base, cnt in Counter(cols).items():
# profile[base2row[base]][j] = cnt / m
profile[base2row[base]][j] = (cnt + 1) / (m + m) # Laplace's Rule of Succession
return profile
def MotifsScore(motifs: List[str]) -> int:
"""
SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column.
"""
from collections import Counter
score = 0
for cols in zip(*motifs):
cnt = Counter(cols)
upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基
for k, v in cnt.items():
if k != upper_case:
score += v
return score
def RandomizedMotifSearch(dnas: List[str], k: int, t: int) -> List[str]:
n = len(dnas[0]) # 每条DNA等长
# 随机生成第一组Motifs
motifs = []
for i in range(t):
beg = randint(0, n - k - 1) # [0,n-k-1]
motifs.append(dnas[i][beg:beg+k]) # 该DNA的随机k-mer子串
bestMotifs, bestScore = motifs, MotifsScore(motifs) # 当前Motifs当作最优解
# 一直迭代找第一个“山谷”
while True:
# 从已有Motifs中生成Profile
profile = MotifsProfile(motifs)
# 在t条DNA中找到Profile-most probable串并组成新的Motifs
motifs = [profileMostProbableKmer(dna, k, profile) for dna in dnas]
# 更新最小Motifs得分
currScore = MotifsScore(motifs)
if currScore < bestScore:
# print(currScore, motifs)
bestMotifs, bestScore = motifs, currScore
else:
return bestMotifs # 书本上让证明一定存在“山谷”,即不会死循环
s = """8 5
CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA
GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
TAGTACCGAGACCGAAAGAAGTATACAGGCGT
TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
AATCCACCAGCTCCACGTGCAATGTTGGCCTA
"""
kt, *dnas = s.splitlines()
# 题目要求运行RandomizedMotifSearch函数1000次,可以使用并行实现
bestMotifs, bestScore = [], float("inf")
for _ in range(1000):
motifs = RandomizedMotifSearch(dnas, *map(int, kt.split()))
currScore = MotifsScore(motifs)
if currScore < bestScore:
# print(currScore, motifs)
bestMotifs, bestScore = motifs, currScore
print(*bestMotifs, sep="\n")
Implement GibbsSampler概率
1、吉布斯生成器【不等概率】
Actually there is no real problem. The GibbsSampler can only find a local optimum, then stabilises there after 100 - 200 runs. So you simply have to run the GibbsSampler multiple times to find the best scoring one of the local optima. It looks like you do this for 20 runs. I guess that is a little low. You run 20 2000 times. So 200 200 should bring better results within the same time.
from typing import List
from random import randint, uniform
def profileRandomKmer(text: str, k: int, profile: List[List[float]]) -> str:
# 输入profile:4 * k,4行对应每列 ACGT 碱基的概率
# 生成非均匀分布(p1,p2,..,pk),根据几何概型得到每个kmer出现概率
# 将所有概率同时除去C=sum(p)得到均匀分布:(p1/C,p2/C,..,pk/C)
# 现在随机生成0~1内的数,看它落在均匀分布的哪个区间,即第几条kmer
base2row = dict(zip('ACGT', range(4)))
ans, ansProbable = "", -1
probabilities = []
for i in range(len(text) - k + 1):
# 计算pattern的概率(由于存在浮点数,无法使用滑窗优化)
probability = 1.0
for j in range(k):
probability *= profile[base2row[text[i+j]]][j]
probabilities.append(probability)
# 随机生成0~1的浮点数,判断其落在(p1/C,p2/C,..,pk/C)均匀分布的哪个位置,即第几条kmer
x = uniform(0, sum(probabilities))
currP = 0
for i, p in enumerate(probabilities):
if currP + p >= x: # currP+p/C >= x
return text[i:i+k]
currP += p
def MotifsProfile(motifs: List[str]) -> List[float]:
"""
Generating Profile from motifs,
Profile[0][j]: the probable of A base in postion j
Profile[1][j]: the probable of C base in postion j
Profile[2][j]: the probable of G base in postion j
Profile[3][j]: the probable of T base in postion j
"""
from collections import Counter
m, n = len(motifs), len(motifs[0])
base2row = dict(zip('ACGT', range(4)))
profile = [[1 / (m + m)] * n for _ in range(4)] # Laplace's Rule of Succession
for j, cols in enumerate(zip(*motifs)):
for base, cnt in Counter(cols).items():
# profile[base2row[base]][j] = cnt / m
profile[base2row[base]][j] = (cnt + 1) / (m + m) # Laplace's Rule of Succession
return profile
def MotifsScore(motifs: List[str]) -> int:
"""
SCORE(Motifs), the number of lower case letters in the motif matrix, column-by-column.
"""
from collections import Counter
score = 0
for cols in zip(*motifs):
cnt = Counter(cols)
upper_case = max(cnt, key=lambda x: cnt[x]) # 找到出现次数最多的碱基
for k, v in cnt.items():
if k != upper_case:
score += v
return score
def GibbsSampler(dnas: List[str], k: int, t: int, N: int) -> List[str]:
n = len(dnas[0]) # 每条DNA等长
# 随机生成第一组Motifs
motifs = []
for i in range(t):
beg = randint(0, n - k - 1) # [0,n-k-1]
motifs.append(dnas[i][beg:beg+k]) # 该DNA的随机k-mer子串
bestMotifs, bestScore = motifs, MotifsScore(motifs) # 当前Motifs当作最优解
# 进行N轮吉布斯抽样
for _ in range(N):
i = randint(0, t - 1)
motifs.pop(i) # 删除第i条序列
profile = MotifsProfile(motifs) # 更具另外t-1条生成Profile
motifs.insert(i, profileRandomKmer(dnas[i], k, profile))
currScore = MotifsScore(motifs)
if currScore < bestScore:
bestMotifs, bestScore = motifs, currScore
return bestMotifs
s = """8 5 100
CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA
GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
TAGTACCGAGACCGAAAGAAGTATACAGGCGT
TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
AATCCACCAGCTCCACGTGCAATGTTGGCCTA
"""
ktN, *dnas = s.splitlines()
# 题目要求运行GibbsSampler函数20次,可以使用并行实现
bestMotifs, bestScore = [], float("inf")
for _ in range(50): # 但是20次对于案例依旧可能输出其它较大解
motifs = GibbsSampler(dnas, *map(int, ktN.split()))
currScore = MotifsScore(motifs)
if currScore < bestScore:
# print(currScore, motifs)
bestMotifs, bestScore = motifs, currScore
print(bestScore, *bestMotifs, sep="\n")
Explanation
Implement DistanceBetweenPatternAndStrings
1、模拟
from typing import List
def hammingDistance(s: str, t: str) -> int:
dist = 0
for ch1, ch2 in zip(s, t):
if ch1 != ch2:
dist += 1
return dist
def DistanceBetweenPatternAndStrings(pattern: str, dnas: List[str]):
k = len(pattern)
totDist = 0
for dna in dnas:
minDist = float("inf")
for i in range(len(dna) - k + 1):
dist = hammingDistance(dna[i:i+k], pattern)
if dist < minDist:
minDist = dist
totDist += minDist
return totDist
s = """AAA
TTACCTTAAC GATATCTGTC ACGGCGTTCG CCCTAAAGAG CGTCAGAGGT
"""
pattern, dnas = s.splitlines()
print(DistanceBetweenPatternAndStrings(pattern, dnas.split()), sep="\n")