Problem
ProblemA matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.A T C C A G C TG G G C A A C TA T G G A T C TDNA Strings A A G C A A C CT T G G A A C TA T G C C A T TA T G G C A C TA 5 1 0 0 5 5 0 0Profile C 0 0 1 4 2 0 6 1G 1 1 6 3 0 1 0 0T 1 5 0 0 0 1 1 6Consensus A T G C A A C T
问题:input: 对于给定的一个fasta文件, 每条序列长度一致output: 输出每个位置上的碱基分布的matrix,并输出由每个位置上最多数量碱基构成的序列
#input:>Rosalind_1ATCCAGCT>Rosalind_2GGGCAACT>Rosalind_3ATGGATCT>Rosalind_4AAGCAACC>Rosalind_5TTGGAACT>Rosalind_6ATGCCATT>Rosalind_7ATGGCACT#output:ATGCAACTA: 5 1 0 0 5 5 0 0C: 0 0 1 4 2 0 6 1G: 1 1 6 3 0 1 0 0T: 1 5 0 0 0 1 1 6
思路:
首先构建一个初始化计数的空
dict结构如下{1:{}, 2:{}, 3:{}, ...}
每次读入利用
enumerate()对每条序列迭代,同时更新计数的dict- 找出
Consensus序列 print格式化输出-
solution -1
就不按照要求的格式输出了 略微麻烦 ```python from collections import defaultdict class Consensus: def init(self, fa_path):
self.fa_path = fa_pathself.count_dict = defaultdict(lambda: defaultdict(int))
def read_fa(self):
with open(self.fa_path) as fd:for line in fd:if not line.startswith(">"):yield line.strip("\n")
def get_loc_count_dict(self):
for seq in self.read_fa():for loc, base in enumerate(seq):self.count_dict[loc][base] += 1
def get_consensus_seq(self):
self.consensus_seq = ""for _, count in self.count_dict.items():self.consensus_seq += max(count, key=lambda k: count[k])
def out_result(self):
for loc, count in self.count_dict.items():print(f"{loc}: ", end = " ")for base in "ATGC":print(f"{base}: {count[base]}", end = " ")print("\n")print(self.consensus_seq)
def run(self):
self.get_loc_count_dict()self.get_consensus_seq()self.out_result()
def main(): fa = “./fa.fasta” test = Consensus(fa) test.run()
if name == ‘main‘: main() ```
