Generate the k-mer Composition of a String

rosalind.info_problems_ba3a_.png

1、所有k-mer子串排序

  1. from typing import List
  2. def StringComposition(k: int, text: str) -> List[str]:
  3. kmers = []
  4. for i in range(len(text) - k + 1):
  5. kmers.append(text[i:i+k])
  6. kmers.sort()
  7. return kmers
  8. # return sorted(text[i:i+k] for i in range(len(text) - k + 1))
  9. s = """5
  10. CAATCCAAC
  11. """
  12. k, text = s.splitlines()
  13. print(*StringComposition(int(k), text), sep="\n")

复杂度分析:设 text 长度为 n

  • 时间复杂度3 - 图2,共生成3 - 图3条长度为3 - 图4的子串,排序消耗
  • 空间复杂度3 - 图5

    2、后缀数组

Reconstruct a String from its Genome Path

rosalind.info_problems_ba3b_.png

1、遍历图

  1. from typing import List
  2. def StringSpelled2GenomePath(patterns: List[str]) -> str:
  3. genome = patterns[0]
  4. for i in range(1, len(patterns)):
  5. genome += patterns[i][-1]
  6. return genome
  7. s = """ACCGA
  8. CCGAA
  9. CGAAG
  10. GAAGC
  11. AAGCT
  12. """
  13. patterns = s.splitlines()
  14. print(StringSpelled2GenomePath(patterns), sep="\n")

Construct the Overlap Graph of a Collection of k-mers

rosalind.info_problems_ba3c_.png

1、O(n^2)建有向图

本题所要求的两点之间有一条有向边3 - 图8的条件是3 - 图9。但是现在不清楚两个节点之间关系,因此我们需要枚举所有3 - 图10对,最暴力方式是双层循环3 - 图11 3 - 图12最后需要注意答案是以邻接表实现,也就是最终答案是以有向图的起点升序。

  1. from typing import List, Tuple
  2. def OverlapGraph(patterns: List[str]) -> List[Tuple[str]]:
  3. graph = []
  4. n = len(patterns)
  5. for i in range(n):
  6. for j in range(i + 1, n):
  7. u, v = patterns[i], patterns[j]
  8. # u -> v
  9. if u[1:] == v[:-1]:
  10. graph.append((u, v))
  11. # v -> u
  12. if v[1:] == u[:-1]:
  13. graph.append((v, u))
  14. # adjacency list
  15. graph.sort()
  16. return graph
  17. s = """ATGCG
  18. GCATG
  19. CATGC
  20. AGGCA
  21. GGCAT
  22. """
  23. patterns = s.splitlines()
  24. for u, v in OverlapGraph(patterns):
  25. print(f"{u} -> {v}")

复杂度分析:设 patterns 长度为 n,graph 长度为 n^2

  • 时间复杂度3 - 图13,对 n^2 个元素排序。
  • 空间复杂度3 - 图14,邻接表

    Construct the De Bruijn Graph of a String

    rosalind.info_problems_ba3d_.png

    1、从Text串获取k-mer串建De Bruijn图

    image.png 3 - 图17```python from typing import List, Tuple from collections import defaultdict

def deBruijnGraph(k: int, text: str) -> List[Tuple[str, List[str]]]: graph = defaultdict(list) # 边个数不变,因此list

  1. # 取k-mer前k-1个字符作节点,边是相邻k-mer的overlap
  2. for i in range(len(text) - k + 1): # 最后一个k-mer只有 k-1 个
  3. graph[text[i:i+k-1]].append(text[i+1:i+k])
  4. # adjacency list
  5. return [(node, sorted(graph[node])) for node in sorted(graph)]

s = “””4 AAGATTCTCTAC””” k, text = s.splitlines() for u, v in deBruijnGraph(int(k), text): print(f”{u} -> {‘,’.join(v)}”)

  1. <a name="dUgyG"></a>
  2. # [Construct the De Bruijn Graph of a Collection of k-mers](https://rosalind.info/problems/ba3e/)
  3. ![rosalind.info_problems_ba3e_.png](https://cdn.nlark.com/yuque/0/2022/png/1438957/1656247597474-dcbe91f4-4148-4786-8b88-01e7db3072de.png#clientId=u488605a0-ad03-4&crop=0&crop=0&crop=1&crop=1&from=drop&id=ub5e3f1c3&margin=%5Bobject%20Object%5D&name=rosalind.info_problems_ba3e_.png&originHeight=1023&originWidth=1175&originalType=binary&ratio=1&rotation=0&showTitle=false&size=85418&status=done&style=none&taskId=u9f0a0831-3e17-4c70-afd3-3d64315a20f&title=)
  4. <a name="AWCoN"></a>
  5. ## 1、从k-mers中建De Bruijn图
  6. 由于输入都是某文本串`text`的所有`k-mer`子串,他们顺序是乱的,但是现在要从它们之中构建出De Bruijn图<br />节点:所有串的前缀和后缀构成的无重复集合。<br />边:长度为`k`的子串`s`的前缀`s[0,n-2]`和后缀`s[1,n-1]`之间构成一条边。
  7. ![](https://cdn.nlark.com/yuque/__graphviz/7f147fc888fc36b07ab34541844300f1.svg#lake_card_v2=eyJ0eXBlIjoiZ3JhcGh2aXoiLCJjb2RlIjoiZGlncmFwaCB7XG5cdHJhbmtkaXI9TFJcblx0bm9kZSBbc2hhcGU9Y2lyY2xlXVxuXHRBR0cgLT4gR0dHXG5cdENBRyAtPiB7QUdHLEFHR31cblx0R0FHIC0-IEFHR1xuXHRHR0EgLT4gR0FHXG5cdEdHRyAtPiB7R0dBLEdHR31cbn0iLCJ1cmwiOiJodHRwczovL2Nkbi5ubGFyay5jb20veXVxdWUvX19ncmFwaHZpei83ZjE0N2ZjODg4ZmMzNmIwN2FiMzQ1NDE4NDQzMDBmMS5zdmciLCJpZCI6IkdxUUhmIiwibWFyZ2luIjp7InRvcCI6dHJ1ZSwiYm90dG9tIjp0cnVlfSwiY2FyZCI6ImRpYWdyYW0ifQ==)```python
  8. from typing import List, Tuple
  9. from collections import defaultdict
  10. def deBruijnGraphFromKmers(patterns: List[str]) -> List[Tuple[str, List[str]]]:
  11. graph = defaultdict(list)
  12. for pattern in patterns:
  13. graph[pattern[:-1]].append(pattern[1:])
  14. # adjacency list
  15. return [(node, sorted(graph[node])) for node in sorted(graph)]
  16. s = """GAGG
  17. CAGG
  18. GGGG
  19. GGGA
  20. CAGG
  21. AGGG
  22. GGAG"""
  23. patterns = s.splitlines()
  24. for u, v in deBruijnGraphFromKmers(patterns):
  25. print(f"{u} -> {','.join(v)}")

Find an Eulerian Cycle in a Graph

rosalind.info_problems_ba3f_.png 3 - 图196->8->7->9->6->5->4->2->1->0->3->2->6