Result 1:Performance of different filtering parameters in Samtools fastq:

(1)-f 3: (Output file size:312M)

Choose the flag = 0x1 (paired-end) and the flag = 0x2 (each segment properly aligned according to the aligner) to include the reads needed.

(2)-F 12: (Output file size:317M)

Choose the flag = 4(segment unmapped) and the flag = 8(next segment in the template unmapped) to exclude the reads not needed.

(3)-F 0x900: (Output file size:319M)

Recommended by the help document in order to obtain the only paired reads by excluding the flag = 0x100 (secondary alignment)and the flag = 0x800 (supplementary alignment)

(4)Parameter explainations:

-f: Include the reads with intended flags
-F: Filter the reads with intended flags
-G: Exclude the reads with intended flags
**

Flags explainations:**

Expressed in a hexadecimal or decimal way consists of the following values:

71D9883A-9F57-42B4-BCFC-F363F9A34D67.png

(5)Result: **The lowest assembly contigs could be obtained by extracting subreads with the parameter -f 3**
C8D11517-5FBA-48D5-8B55-A7900EA2058D.png

**

Result 2: Assess the assemblies with different subsampling coverage and SPAdes parameters (—meta/—isolate)

30X 100X 1000X 10000X 100000X

The third sample as an example:
13.Improved work 2: Different Samtools fastq and SPAdes parameters - 图4

The fifth sample as an example:**
13.Improved work 2: Different Samtools fastq and SPAdes parameters - 图5

Result:Reads are better assembled with 1000 ~ 10000 subsampling coverage. As for SPAdes parameters, assemblies by “—meta” performed better than “—isolate” with less assembled contigs.
**


Result 3: Package the pipeline :

(1)Goals:
Input the original paired-end fastq files、index files、subsampling fraction and assembling strategies, the output file will be the finally assembled fasta files with only single contigs and the whole process taking around 7 hours for each sample.

Usage:
Run the following shell from the command line:
#!/bin/bash
#SBATCH -e subfqall.err
#SBATCH -o subfqall.out
#SBATCH -J subfqall
#SBATCH --ntasks-per-node=2
#SBATCH -N 1#SBATCH -p sugon
#SBATCH -t 120:00:00

python ``ymy.py`` ``--filea`` ./P3-VERO-P3-1-vero_L4_1.fq.gz ``--fileb`` ./P3-VERO-P3-1-vero_L4_2.fq.gz ``--index`` ./2019-nCoV.fasta ``--fraction`` 0.0036 ``--startegy`` meta
(2)—help**
D7FFFFC8-48FC-4610-B9D8-025AE060AC6F.png
(3)Code:**

  1. #!/usr/bin/python
  2. #-*- coding:utf-8 -*-
  3. import argparse
  4. import os
  5. def _argparse():
  6. parser = argparse.ArgumentParser(description='Process spades from fastq')
  7. parser.add_argument('--filea', dest='file1', default='./P3-VERO-P3-1-vero_L4_1.fq.gz', help="The read1 fastq file")
  8. parser.add_argument('--fileb', dest='file2', default='./P3-VERO-P3-1-vero_L4_2.fq.gz', help="The read2 fastq file")
  9. parser.add_argument('--index', dest='fileindex', default='./2019-nCoV.fasta', help="The index")
  10. parser.add_argument('--fraction', dest='fra', default='0.00036', help="The subsample fraction")
  11. parser.add_argument('--startegy', dest='star', default='--meta', help="The startegy of spades")
  12. return parser.parse_args()
  13. def zuzhuang(x,y,z,w,t):
  14. '''
  15. os.system(f"fastp -i {x} -I {y} -o out1.fastq.gz -O out2.fastq.gz")
  16. os.system(f"bwa mem -t 20 {z} out1.fq.gz out2.fq.gz | samtools sort -@ 10 -O bam -o sorted.bam")
  17. os.system(f"samtools index -@ 8 sorted.bam sorted.bam.bai")
  18. os.system(f"sambamba view -t 10 -h -s {w} sorted.bam -o subsorted.bam")
  19. os.system(f"samtools fastq -s -f 3 -1 paired1.fq -2 paired2.fq subsorted.bam")
  20. os.system(f"spades.py -{t} -1 paired1.fq -2 paired2.fq -o outputfile")
  21. '''
  22. os.system("fastp -i {x} -I {y} -o out1.fastq.gz -O out2.fastq.gz".format(x=x, y=y))
  23. os.system("bwa mem -t 20 {z} out1.fq.gz out2.fq.gz | samtools sort -@ 10 -O bam -o sorted.bam".format(z = z))
  24. os.system("samtools index -@ 8 sorted.bam sorted.bam.bai")
  25. os.system("sambamba view -t 10 -h -s {w} sorted.bam -o subsorted.bam".format(w = w))
  26. os.system("samtools fastq -s -f 3 -1 paired1.fq -2 paired2.fq subsorted.bam")
  27. os.system("spades.py --{t} -1 paired1.fq -2 paired2.fq -o outputfile".format(t = t))
  28. #return outputfile
  29. print("all finished!")
  30. def main():
  31. parser = _argparse()
  32. zuzhuang(parser.file1, parser.file2, parser.fileindex, parser.fra, parser.star)
  33. if __name__ == '__main__':
  34. main()

(4)Performance on 11 samples :

Parameter selection:
Filtering parameter during subsampling : -f 3
Subsampling coverage selection: 1000-10000
SPAdes strategies: —meta,**

Assembled results of 11 samples:
Each sample only has one contig in the assemblie. Some indicators, as the coverage of reference genome, the miss-assembled rates and the error rate are more reliable than before.
EFF62EF0-AB95-468D-97B4-8FB12D3EFE66.png

Result 4: Summary:

Several tricky points:

  1. Ignoring the addition of “-s “ parameter during Samtools fastq can cause errors in subsequent assembly suggesting the reads name are not matched.
  2. Samtools fastq filtering with “-f 3” parameters can reduce the number of finally assembled contigs to one.
  3. No improvement is found using the corrected reads from the first SPAdes result files (To note that if you use “—isolate” ,there will be no corrected results. )
  4. The unpaired reads resulted from SPAdes is useful in Pilon and act as the unpaired bam files.