Problem

  1. Problem
  2. A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.
  3. Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).
  4. A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
  5. A T C C A G C T
  6. G G G C A A C T
  7. A T G G A T C T
  8. DNA Strings A A G C A A C C
  9. T T G G A A C T
  10. A T G C C A T T
  11. A T G G C A C T
  12. A 5 1 0 0 5 5 0 0
  13. Profile C 0 0 1 4 2 0 6 1
  14. G 1 1 6 3 0 1 0 0
  15. T 1 5 0 0 0 1 1 6
  16. Consensus A T G C A A C T

问题:
input: 对于给定的一个fasta文件, 每条序列长度一致
output: 输出每个位置上的碱基分布的matrix,并输出由每个位置上最多数量碱基构成的序列

  1. #input:
  2. >Rosalind_1
  3. ATCCAGCT
  4. >Rosalind_2
  5. GGGCAACT
  6. >Rosalind_3
  7. ATGGATCT
  8. >Rosalind_4
  9. AAGCAACC
  10. >Rosalind_5
  11. TTGGAACT
  12. >Rosalind_6
  13. ATGCCATT
  14. >Rosalind_7
  15. ATGGCACT
  16. #output:
  17. ATGCAACT
  18. A: 5 1 0 0 5 5 0 0
  19. C: 0 0 1 4 2 0 6 1
  20. G: 1 1 6 3 0 1 0 0
  21. T: 1 5 0 0 0 1 1 6

思路:

  • 首先构建一个初始化计数的空dict结构如下

    1. {1:{}, 2:{}, 3:{}, ...}
  • 每次读入利用enumerate()对每条序列迭代,同时更新计数的dict

  • 找出Consensus序列
  • print格式化输出
  • 可以利用生成器快速读取fasta

    solution -1

    就不按照要求的格式输出了 略微麻烦 ```python from collections import defaultdict class Consensus: def init(self, fa_path):

    1. self.fa_path = fa_path
    2. self.count_dict = defaultdict(lambda: defaultdict(int))

    def read_fa(self):

    1. with open(self.fa_path) as fd:
    2. for line in fd:
    3. if not line.startswith(">"):
    4. yield line.strip("\n")

    def get_loc_count_dict(self):

    1. for seq in self.read_fa():
    2. for loc, base in enumerate(seq):
    3. self.count_dict[loc][base] += 1

    def get_consensus_seq(self):

    1. self.consensus_seq = ""
    2. for _, count in self.count_dict.items():
    3. self.consensus_seq += max(count, key=lambda k: count[k])

    def out_result(self):

    1. for loc, count in self.count_dict.items():
    2. print(f"{loc}: ", end = " ")
    3. for base in "ATGC":
    4. print(f"{base}: {count[base]}", end = " ")
    5. print("\n")
    6. print(self.consensus_seq)

    def run(self):

    1. self.get_loc_count_dict()
    2. self.get_consensus_seq()
    3. self.out_result()

def main(): fa = “./fa.fasta” test = Consensus(fa) test.run()

if name == ‘main‘: main() ```