Biopython: 超快一键建树 - 《Protocol for Bioinformatics》

Align your Sequence
Trimming the gap
Build the tree
Visualizing it with ggtree
How to use


© Karobben

由於語法渲染問題而影響閱讀體驗，請移步博客閱讀～
本文GitPage地址

Align your Sequence

Set your arguments

INPUT  = ""
OUTPUT = ""
ARG_TREE = "" # NJ / UPGMA

Import libraries

import os
from Bio.Align.Applications import ClustalwCommandline
from Bio import AlignIO
from Bio import Phylo

Pre-align

cwline = ClustalwCommandline("clustalw2", infile= INPUT, outfile = OUTPUT + ".fasta",  output= "fasta" )
print(cwline)
stdout, stderr = cwline()

clustalw2 -infile=Base.fa -outfile=assignment2.fasta -output=fasta

Trimming the gap

Gap from the head and tail could have huge effects on the result of the tree. So, we should avoid the side effects from the gaps because of the length-difference.

Here, we just counting the gaps from the head and tail of each sequences and retain the largest number. On the other hand, the numbers’ of gap extension were printed out so you can customize your own.

## reading the alignments
align = AlignIO.read(OUTPUT + ".fasta", "fasta")

## Count the longest space on the head

Gap_head = 0
Gap_tail = 0
Gap_list_head = []
Gap_list_tail = []
for seq_tmp in align:
    Num = 0
    while seq_tmp[Num] == "-":
        Num +=1
    Gap_list_head += [Num]
    if Num > Gap_head:
        Gap_head = Num
## Count the longest space on the tail
for seq_tmp in align:
    Num = 0
    while seq_tmp[::-1][Num] == "-":
        Num +=1
    Gap_list_tail += [Num]
    if Num > Gap_tail:
        Gap_tail = Num

print("\nGap head list: ", Gap_list_head,
      "\nGap head tail: ", Gap_list_tail)
print("\nGap head: ", Gap_head,
      "\nGap head: ", Gap_tail)

## Count the longest space on the head


# Alignment objects can be manipulated

# slice alignment
align_slice = align[:, Gap_head : len(seq_tmp)-Gap_tail]
print ("Slice of alignment from position " + str(Gap_head) + " to " + str(Gap_tail) +"\n")
print(align_slice)

## wirte it out
F = open(OUTPUT + "_scliced.fa", 'w')
F2 = open(OUTPUT + "_gap.fa", 'w')
for Seq_tmp in align_slice:
    F.write(">" + Seq_tmp.id+"\n")
    F2.write(">" + Seq_tmp.id+"\n")
    Seq_seq = str(Seq_tmp.seq)
    F2.write(Seq_seq+"\n")
    Seq_seq = Seq_seq.replace("-","")
    F.write(Seq_seq+"\n")
F.close()
F2.close()

Gap head list:  [115, 115, 86, 86, 86, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 91, 170, 70, 115, 117, 0, 115, 115, 192]
Gap head tail:  [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 0, 0, 12]

Gap head:  192
Gap head:  13
Slice of alignment from position 192 to 13

Alignment with 28 rows and 167 columns

Build the tree

##
# Make Phylogenetic Tree

cwline = ClustalwCommandline("clustalw2", infile= OUTPUT + "_gap.fa",  clustering = ARG_TREE, bootstrap = 1000)
print(cwline)
stdout, stderr = cwline()


tree = Phylo.read(OUTPUT.split(".")[0]+"_gap.phb", "newick")
Phylo.draw_ascii(tree)

# select the 3 sequences in the top branch of phylo tree

align_branch = align_slice[0:3]
print(align_branch)

  , XP_009431258.3
  |
  | , XP_005584739.1
  | |
  |,| NP_001036203.1
  |||
  ||| XP_011717782.1
  ||
  |, XP_011852471.1
  ||
  || NP_001277233.1
  ||
  ,| XP_025230146.1
  ||
  |, AAU95547.1
  ||
  || XP_011803362.1
  ||
  || XP_010378328.1
  |
  | , XP_033039886.1
  |,|
  |||__ XP_037851661.1
  |||
  |||________ XP_026309478.1
  ||
  ||  ___ XP_011926036.1
  ||,|
  ||||_____ XP_017738318.1
  |||
  | |          _______________ XP_024205865.1
  | |_________|
  |           |_______________ XP_024205864.1
  |           |
  |           |_________________________________________ NP_175517.1
  |
  | XP_008962016.1
  |
  | XP_030866972.1
  |
  | XP_009250187.1
  |
  | XP_004041225.1
  |
  | XP_004041226.1
  |
  | XP_002827755.1
  |
  | XP_002827754.1
  _|
  , sp|P01241|SOMA_HUMAN
  |
  | XP_002827756.1
  |
  | XP_024090242.1

Visualizing it with ggtree

Reference: ggplot2 extensions; 2016

args <- commandArgs(trailingOnly = TRUE)
print("Input", str(args))

library("ggtree")


tree <- read.tree(args)
ggtree(tree, branch.length = 'none') +
  geom_text2(aes(subset=!isTip, label=node), hjust= 1.5, vjust = -0.5) +
  geom_tiplab() +
  xlim(NA, 15 + length(strsplit(tree$tip.label[1], "*")[[1]])/2.5)
ggsave(paste(args, ".png", sep=""))

### save the tree as png
PATH_lib = os.path.dirname(__file__)
CMD = "Rscript " + PATH_lib + "/lib/R/ggtree.R " + OUTPUT +  "_gap.phb"
print(CMD)
os.system(CMD)

Biopython: 超快一键建树 - 图2

How to use

The quickest way:

input file: sample.fasta
output file prefix: result
tree type: UPGMA

Seq2tree.py -i sample.fasta -o result -t UPGMA

results:

result.fasta: original aligned file (with gaps)
result_gap.fa: Trimmed file (with gaps)
result_scliced.Trimmed file (without gaps)
result_gap.phb: Tree file
result_gap.phb.png: Tree visualizing with ggtree—-

Enjoy~
本文由Python腳本GitHub/語雀自動更新

由於語法渲染問題而影響閱讀體驗，請移步博客閱讀～
本文GitPage地址

GitHub: Karobben
Blog:Karobben
BiliBili:史上最不正經的生物狗