Generalizing GC-Content
A length k substring of a genetic string is commonly called a k-mer. A genetic string of length n can be seen as composed of n−k+1 overlapping k-mers. The k-mer composition of a genetic string encodes the number of times that each possible k-mer occurs in the string. See Figure 1.
Figure 1. The 2-mer composition of TTGATTACCTTATTTGATCATTACACATTGTACGCTTGTGTCAAAATATCACATGTGCCT
The 1-mer composition is a generalization of the GC-content of a strand of DNA, and the 2-mer, 3-mer, and 4-mer compositions of a DNA string are also commonly known as its di-nucleotide, tri-nucleotide, and tetra-nucleotide compositions.
The biological significance of k-mer composition is manyfold. GC-content is helpful not only in helping to identify a piece of unknown DNA (see “Computing GC Content”), but also because a genomic region having high GC-content compared to the rest of the genome signals that it may belong to an exon. Analyzing k-mer composition is vital to fragment assembly as well.
In “Computing GC Content”, we also drew an analogy between analyzing the frequency of characters and identifying the underlying language. For larger values of k, the k-mer composition offers a more robust fingerprint of a string’s identity because it offers an analysis on the scale of substrings (i.e., words) instead of that of single symbols. As a basis of comparison, in language analysis, the k-mer composition of a text can be used not only to pin down the language, but also often the author.
Problem
For a fixed positive integer k, order all possible k-mers taken from an underlying alphabet lexicographically.
Then the k-mer composition of a string s can be represented by an array A for which A[m] denotes the number of times that the mth k-mer (with respect to the lexicographic order) appears in s.
Given: A DNA string s in FASTA format (having length at most 100 kbp).
Return: The 4-mer composition of s.
Sample Dataset
Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT
TTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCA
AATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCG
GGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGA
CTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAGSample Output
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1
Solution1:暴力算法
首先理解什么是
K-mer
,即长度为k
的子串(注意是连续,不是子序列)。例如CTTCGAAAGT
的4-mer
个数:
C | T | T | C | G | A | A | A | G | T |
---|---|---|---|---|---|---|---|---|---|
C | T | T | C | ||||||
T | T | C | G | ||||||
T | C | G | A | ||||||
C | G | A | A | ||||||
G | A | A | A | ||||||
A | A | A | G | ||||||
A | A | G | T |
也即是说枚举 k-mer
起点 i
,对应终点 i+k-1
就行。
第二步:将所有的 k-mer
进行同类计数,然后按字典序升序排序返回结果。
from typing import List
import re
class Solution:
def generateKmer(self, bases: List[str], path: List[str], k: int) -> str:
if len(path) == k:
yield ''.join(path)
return
for base in bases:
path.append(base)
yield from self.generateKmer(bases, path, k)
path.pop()
def kmerLexicographically(self, DNA: str, k: int) -> List[int]:
# 1. 寻找 DNA 所有 kmer 并计数
from collections import Counter
freq = Counter(DNA[i:i+k] for i in range(0, len(DNA) - k + 1))
# 2. 按字典序生成 k-mer 所有组合
return [freq[kmer] for kmer in self.generateKmer(['A', 'C', 'G', 'T'], [], k)]
seqs = """
>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT
TTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCA
AATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCG
GGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGA
CTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG
"""
DNA = list(seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', ''))[0] # 第一个是 `\n`
k = 4
for idx in Solution().kmerLexicographically(DNA, k):
print(idx, end=' ')
复杂度分析:
- 时间复杂度:,其中枚举所有
k-mer
可能组合为 ;枚举DNA
所有长度为k
的子串然后哈希化为 。 - 空间复杂度:,其中枚举所有
k-mer
可能组合为 ;枚举DNA
所有长度为k
的子串为为 ,哈希计数器为 。Solution2: 二进制迭代优化
一、k-mer 枚举二进制优化:
- 四进制 :由于输入序列只有四个字符,我们可以使用 两个比特位 来表示四个字符,由于涉及到字典序因此以
00
表示A
;以01
表示C
;以10
表示G
;以11
表示T
。 - 枚举长度为
k
的所有字典序组合,只需要保证每层子树是按字典序枚举即可。 用二进制状态表示如下序列。
Kmer二进制字典序状态.drawio
:::tips
Q:为什么用二进制表示长度为 k
的子串?
A:本质上就是手动实现哈希函数,因为 python 中以 str
为字典键,它会先执行 hash(str)
获得一个哈希值。 k=4
的所有组合可能值为 0-255
( 00000000
-> 11111111
)
:::
二、DNA 枚举 k-mer:滚动哈希优化
长度为 k = 4
如何快速计算其哈希值呢?如果暴力的方式就是每次计算,这样的时间复杂度为 。
C | T | T | C | G | A | A | A | G | T |
---|---|---|---|---|---|---|---|---|---|
01 | 11 | 11 | 01 | ||||||
11 | 11 | 01 | 10 | ||||||
11 | 01 | 10 | 00 | ||||||
01 | 10 | 00 | 00 | ||||||
10 | 00 | 00 | 00 | ||||||
00 | 00 | 00 | 10 | ||||||
00 | 00 | 10 | 11 |
注意上面相邻 k-mer
之间的关系,两相邻 k-mer
之间重叠部分有 k-1
,因此我们可以想办法减少这部分的运算。
- 获取序列的哈希值,如上图中:
CTTC
的哈希值为TTCG
的哈希值为
- 滚动哈希优化:假设
CTTC
的哈希值 值已知,如何快速计算其相邻序列TTCG
的哈希值 呢?CTTC
的后3
位哈希值和为TTCG
的前3
位哈希值和为- 我们知道其是重叠片段,且易知
TTCG
的前3
位哈希值和就是CTTC
的后3
位哈希值和再乘上4
。即有
考虑由序列 CTTC
转变为 TTCG
后,我们需要对 做什么转变才能得到 呢?
CTTC
要删除最左边的C
字符哈希值,才能获得后3
位哈希值和为TTCG
不仅要使用CTTC
的后3
位哈希值和再乘上4
,还要要增加最右侧的G
字符哈希值,即
综上整理得到滚动哈希一般表达式:。其中 表示权值, 中 表示进制, 表示权,整个表达式表示位权。
from typing import List
import re
class Solution:
def kmerLexicographically(self, DNA: str, k: int) -> List[int]:
base = 4 # 4 进制
# 1. k-mer 所有组合,以二进制表示
cnt = [0] * pow(base, k) # 共有 4* k 个状态
# 2. 使用滚动哈希寻找 DNA 所有 kmer 并计数
bases = {'A': 0b00, 'C': 0b01, 'G': 0b10, 'T': 0b11}
# (1) 初始化计算长度为 k 的子串哈希和
h1 = 0
for i in range(k):
h1 = h1 * base + bases[DNA[i]]
cnt[h1] += 1 # 别忘第一个
# (2) 滚动哈希
highPow = pow(base, k-1)
for j in range(k, len(DNA)): # 枚举终点
h1 = (h1 - bases[DNA[j-k]] * highPow) * base + bases[DNA[j]]
cnt[h1] += 1
return cnt
seqs = """
>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT
TTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCA
AATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCG
GGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGA
CTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG
"""
DNA = list(seq.replace('\n', '') for seq in re.split(r'>.*', seqs) if seq.replace('\n', ''))[0] # 第一个是 `\n`
k = 4
ret = Solution().kmerLexicographically(DNA, k)
for idx in Solution().kmerLexicographically(DNA, k):
print(idx, end=' ')
复杂度分析:
- 时间复杂度:,其中枚举所有
k-mer
可能组合为 ;枚举所有长度为k
的子串耗费 ;一次长度为k
的子串哈希化为 ,其后使用滚动哈希策略可以在 时间内计算处新的哈希值。 - 空间复杂度:,其中枚举所有
k-mer
可能组合为 ;枚举DNA
所有长度为k
的子串哈希值为