10.Consensus and Profile - 《Rosalind and BioLeetCode - 生信版 Leetcode》

Problem
solution -1

Problem

Problem
A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.
Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).
A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
A T C C A G C T
G G G C A A C T
A T G G A T C T
DNA Strings    A A G C A A C C
T T G G A A C T
A T G C C A T T
A T G G C A C T
A   5 1 0 0 5 5 0 0
Profile    C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6
Consensus    A T G C A A C T

问题：
input: 对于给定的一个fasta文件, 每条序列长度一致
output: 输出每个位置上的碱基分布的matrix，并输出由每个位置上最多数量碱基构成的序列

#input:
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
#output:
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

思路：

首先构建一个初始化计数的空dict结构如下
```
{1:{}, 2:{}, 3:{}, ...}
```
每次读入利用enumerate()对每条序列迭代，同时更新计数的dict
找出Consensus序列
print格式化输出

可以利用生成器快速读取fasta

solution -1

就不按照要求的格式输出了略微麻烦 ```python from collections import defaultdict class Consensus: def init(self, fa_path):

  self.fa_path = fa_path
  self.count_dict = defaultdict(lambda: defaultdict(int))

def read_fa(self):

  with open(self.fa_path) as fd:
      for line in fd:
          if not line.startswith(">"):
              yield line.strip("\n")

def get_loc_count_dict(self):

  for seq in self.read_fa():
      for loc, base in enumerate(seq):
          self.count_dict[loc][base] += 1

def get_consensus_seq(self):

  self.consensus_seq = ""
  for _, count in self.count_dict.items():
      self.consensus_seq += max(count, key=lambda k: count[k])

def out_result(self):

  for loc, count in self.count_dict.items():
      print(f"{loc}: ", end = " ")
      for base in "ATGC":
          print(f"{base}: {count[base]}", end = " ")
      print("\n")
  print(self.consensus_seq)

def run(self):

  self.get_loc_count_dict()
  self.get_consensus_seq()
  self.out_result()

def main(): fa = “./fa.fasta” test = Consensus(fa) test.run()

if name == ‘main‘: main() ```