Since December 2019, the COVID-19 caused by a RNA coronavirus, SARS-CoV-2, has spread rapidly worldwide. Viruses can mutate and evolve rapidly during replication and spread. Therefore, studying the evolution and identifying the substitutions of SARS-COV-2 are essential for COVID-19 control. Previously, SARS-CoV-2 isolates from 11 patients were collected and the sequencing data were obtained by next-generation sequencing. In this workflow, the viral sequences were assembled and then compared with sequences in database to check its mutation. According to the phylogenetic analysis of SARS-CoV-2, the evolutionary history can be reconstructed to identify the virus infection source and transmission chain.

The Sequence File

These sequencing files were obtained from viral genome sequencing. These viruses were isolated from 11 COVID-19 patients. The sequence files we got are in FASTQ format. Then use the following code to view one of these files.

  1. zless -S P3-VERO-P3-1-vero_L4_1.fq.gz | head -n 8
  • A FASTQ file normally uses four lines for storing sequencing information per sequence.

    1. @A00821:275:HWMMWDSXX:4:1101:1298:1016 1:N:0:ATTACTCG+TAGATCGC
    2. GCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTT
    3. +
    4. FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
    • This first line always starts with a @ character and is followed by a sequence identifier.
    • The second line is the raw sequence letters.
    • The last line encodes the quality values for the sequence in the second line. If p is the error probability, then: Viral Sequencing Reads Analysis_pipeline - 图1. Q scores are often represented as ASCII characters.

      Workflow

      1. Install software

      During the analysis process, we need many tools to analyze our data. These tools can be downloaded using Conda. Conda is a powerful package manager and environment manager. Conda quickly installs, runs and updates packages and their dependencies, as well, it easily creates, saves, loads, and switches between environments. ```shell

      ———————-Create a environment

      conda create -y -n covid python=3

      Activate covid environment

      conda activate covid

———————-Install software in ‘covid’ environment

conda install -y fastp trim-galore bwa samtools sambamba spades quast pilon mafft iqtree

  1. <a name="Os74W"></a>
  2. ### 2. Create and manage directories
  3. ```shell
  4. mkdir SARS_CoV_2
  5. cd SARS_CoV_2
  6. mkdir {raw_data,qc,clean_data,mapping,sort,subsample,fastq,genome,quast,Pilon,mafft}

3. Quality Control

Before further analysis, low-quality regions and adaptors should be trimmed and removed by Fastp or trim_galore from raw sequencing data. This step is important to avoid issues in downstream analyses such as variant analysis or assembly.

  1. cd ~/SARS_CoV_2/qc
  2. ##---------------fastp
  3. for file in $(ls ~/SARS_CoV_2/raw_data/*_1.fq.gz)
  4. do
  5. fullname=${file##*/}
  6. fastp -i ${file%1.*}1.fq.gz \
  7. -o ~/SARS_CoV_2/clean_data/${fullname%1.*}1.fq.gz \
  8. -I ${file%1.*}2.fq.gz \
  9. -O ~/SARS_CoV_2/clean_data/${fullname%1.*}1.fq.gz \
  10. -w 4 \
  11. --html ${fullname%_1.*}.html \
  12. --json ${fullname%_1.*}.json
  13. done
  14. or
  15. ##---------------trim_galore
  16. for file in $(ls ~/SARS_CoV_2/raw_data/*_1.fq.gz)
  17. do
  18. trim_galore --output_dir ~/SARS_CoV_2/clean_data \
  19. --phred33 --paired \
  20. --length 75 --quality 30 --stringency 5 \
  21. ${file%1.*}1.fq.gz ${file%1.*}2.fq.gz
  22. done

In this period, we chose Fastp to filter the raw data. Fastp per-cycle quality profiles, per-cycle base contents, adapter trimming results, and k-mer counts. These values will be merged after all reads are processed, and a reporter will generate reports in HTML and JSON formats. Fastp reports statistical values for pre-filtering and post-filtering data to facilitate comparisons of changes in data quality after filtering is complete.

  • HTML report

Viral Sequencing Reads Analysis_pipeline - 图2

4. Map sequence

Paired reads were aligned to the 29903nt Wuhan reference genome (NCBI NC_045512.2) using BWA.

  1. ##---------------Build index
  2. cd ~/SARS_CoV_2/mapping
  3. mkdir ref
  4. # Download the reference genome from the URL: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta, and rename the fasta file to 'Wuhan-Hu-1.fasta'
  5. bwa index Wuhan-Hu-1.fasta
  1. ##---------------Map sequence by bwa-mem
  2. cd ~/SARS_CoV_2/mapping
  3. for file in $(ls ~/SARS_CoV_2/clean_data/*_1.fq.gz)
  4. do
  5. fullname=${file##*/}
  6. INDEX=~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta
  7. bwa mem -t 4 -M $INDEX \
  8. ${file%_1.*}_1.fq.gz \
  9. ${file%_1.*}_2.fq.gz \
  10. > ~/SARS_CoV_2/mapping/${fullname%_1.*}.sam \
  11. 2> ~/SARS_CoV_2/mapping/${fullname%_1.*}.log
  12. done
  13. for file in $(ls ~/SARS_CoV_2/mapping/*.sam)
  14. do
  15. fullname=${file##*/}
  16. samtools view -b -F 12 $file > ${file%.sam*}.bam &&\
  17. samtools sort -@ 4 -l 9 ${file%.sam*}.bam -o ~/SARS_CoV_2/sort/${fullname%.sam*}_sort.bam &&\
  18. samtools index ~/SARS_CoV_2/sort/${fullname%.sam*}_sort.bam ~/SARS_CoV_2/sort/${fullname%.sam*}_sort.bam.bai
  19. done
  20. ##---------------Count the number of alignments
  21. cd ~/SARS_CoV_2/sort
  22. for file in $(ls ~/SARS_CoV_2/sort/*.bam)
  23. do
  24. fullname=${file##*/}
  25. samtools flagstat $file > ${fullname%.bam*}_flagstat.txt
  26. done
  27. ##---------------Calculate the sequencing depth
  28. for file in $(ls ~/SARS_CoV_2/sort/*.bam)
  29. do
  30. fullname=${file##*/}
  31. samtools depth -d 0 $file > ${fullname%.bam*}_depth.txt
  32. done

5. Assemble genome

SPAdes is an assembly toolkit mainly used for small genomes, so it’s a good choice for assembling viral genomes. SPAdes contains various assembly pipelines. In this step, we chose the pipeline, metaSPAdes. In the next part, we would compare the different effects between the metaSPAdes and others.

  1. ##---------------Subsample reads
  2. cd ~/SARS_CoV_2/subsample
  3. for file in $(ls ~/SARS_CoV_2/sort/*.bam)
  4. do
  5. fullname=${file##*/}
  6. sambamba view -h -s 0.001 $file -o ~/SARS_CoV_2/subsample/${fullname%-vero*}-0.001.bam
  7. done
  8. ##---------------Sort and build index
  9. for file in $(ls ~/SARS_CoV_2/subsample/*.bam)
  10. do
  11. samtools sort -@ 4 -l 9 $file -o ${file%.bam}_sort.bam &&\
  12. samtools index ${file%.bam}_sort.bam ${file%.bam}_sort.bam.bai
  13. done
  14. ##---------------Convert the BAM to FASTQ
  15. cd ~/SARS_CoV_2/fastq
  16. for file in $(ls ~/SARS_CoV_2/subsample/*_sort.bam)
  17. do
  18. fullname=${file##*/}
  19. samtools fastq $file > ~/SARS_CoV_2/fastq/${fullname%_sort.bam*}.fq
  20. done
  21. ##---------------Assemble
  22. cd ~/SARS_CoV_2/genome
  23. for file in $(ls ~/SARS_CoV_2/fastq/*.fq)
  24. do
  25. fullname=${file##*/}
  26. spades.py --meta --12 $file -o ~/SARS_CoV_2/genome/${fullname%.fq*}_meta.fasta
  27. done

6. Evaluate assembly

To evaluate the effect of assembly, we used MetaQUAST to compare the assembled genome with the reference genome(NCBI NC_045512.2).

  • N50

For genome assembly results, we usually evaluate them by N50 and other indicators. For an assembled sequence, the sequences are sorted from the largest to the smallest according to their length, and then the lengths are accumulated starting from the first sequence until the cumulative length exceeds 50% of the total length. At this point, the length of the last contig accumulated is the length of N50. The schematic diagram is as follows:
y5kk5zcfvy.jpeg

a. scaffolds.fasta/contigs.fasta

20210419162842.pngcbin

  1. python ~/miniconda3/envs/covid/bin/metaquast.py scaffolds.fasta contigs.fasta -r ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta -t 4 -o result_version

Viral Sequencing Reads Analysis_pipeline - 图5
The N50 is not shown in the screenshot, but according to the definition of N50, assembled genomes can be measured by the ratio of the largest alignment to the total aligned length. The higher the ratio, the better. scaffolds.fasta and contigs.fasta have no difference in the assembly results. We can pick any one of them for subsequent analysis.

b. Different pipelines in MAFFT

When we assemble a genome using MAFFT, there are three options(—meta/—sc/—iso) we can use.
Viral Sequencing Reads Analysis_pipeline - 图6
When using the metaSPAdes pipeline, the ratio of the largest alignment to the total aligned length is higher.

c. Different subsampling coverage

Viral Sequencing Reads Analysis_pipeline - 图7
Before assembling the genome, we subsampled our reads to get better results. Using different subsampling coverage makes a difference. For the first genome, 0.006 is a better fraction. However, for other genomes, it could be another one.

7. Improve assemblies

Pilon is a software tool that can be used to automatically improve draft assemblies by comparing the raw sequenced data with the assembled genome.

  1. ##---------------Build index
  2. cd ~/SARS_CoV_2/Pilon
  3. for i in {1..11}
  4. do
  5. mkdir P3-${i}
  6. bwa index -p P3-${i}/P3-${i}_meta ~/SARS_CoV_2/genome/*${i}*.fasta
  7. done
  8. ##---------------Align and sort
  9. for i in {1..11}
  10. do
  11. bwa mem -t 20 ~/SARS_CoV_2/Pilon/P3-${i}/P3-${i}_meta \
  12. ~/SARS_CoV_2/clean_data/*-${i}-*_1.fq.gz \
  13. ~/SARS_CoV_2/clean_data/*-${i}-*_2.fq.gz \
  14. | samtools sort -@ 10 -O bam -o ./P3-${i}/P3-${i}_meta.bam &&\
  15. samtools index -@ 10 ./P3-${i}/P3-${i}_meta.bam
  16. done
  17. ##---------------Mark duplication
  18. for i in {1..11}
  19. do
  20. sambamba markdup -t 20 --hash-table-size=1000000 --overflow-list-size=1000000 ./P3-${i}/P3-${i}_meta.bam ./P3-${i}/P3-${i}_markdup.bam
  21. done
  22. ##---------------Filter
  23. for i in {1..11}
  24. do
  25. samtools view -@ 10 -q 30 -b -h ./P3-${i}/P3-${i}_markdup.bam > ./P3-${i}/P3-${i}_filter.bam &&\
  26. samtools index -@ 10 ./P3-${i}/P3-${i}_filter.bam
  27. done
  28. ##---------------Improve assemblies
  29. for i in {1..11}
  30. do
  31. java -Xmx16G -jar ~/miniconda3/envs/covid/share/pilon-1.23-3/pilon-1.23.jar \
  32. --genome ./P3-${i}/P3-${i}_meta.fasta \
  33. --frags ./P3-${i}/P3-${i}_filter.bam \
  34. --changes --fix all,breaks\
  35. --output ./P3-${i}/pilon_polished_${i} \
  36. --vcf &> ./P3-${i}/pilon${i}.log
  37. done

8. Identify mutations

a. Find Variation Using Pilon

Input to Pilon consists of the NC_045512.2 genome in FASTA format along with the BAM files of sequencing reads aligned to the NC_045512.2 genome.

  1. ##---------------Identify mutations
  2. for i in {1..11}
  3. do
  4. java -Xmx64G -jar ~/miniconda3/envs/covid/share/pilon-1.23-3/pilon-1.23.jar \
  5. --genome ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta \
  6. --frags ~/SARS_CoV_2/subsample/*${i}*_sort.bam \
  7. --changes \
  8. --output ${i}.fasta \
  9. --fix all,breaks\
  10. done

· View **.changes** File
The .changes file contains a space-delimited record of every change made in the assembly as instructed by the --fix option.

  1. for i in {1..11}
  2. do
  3. echo ">${i}.fasta.changes" >> sum.txt
  4. cat ${i}.fasta.changes >>sum.txt
  5. done
  1. >1.fasta.changes
  2. NC_045512.2:11083 NC_045512.2_pilon:11083 G T
  3. NC_045512.2:23607 NC_045512.2_pilon:23607 G A
  4. NC_045512.2:26144 NC_045512.2_pilon:26144 G T
  5. >2.fasta.changes
  6. NC_045512.2:8782 NC_045512.2_pilon:8782 C T
  7. NC_045512.2:28144 NC_045512.2_pilon:28144 T C
  8. >3.fasta.changes
  9. NC_045512.2:16846 NC_045512.2_pilon:16846 G T
  10. >4.fasta.changes
  11. >5.fasta.changes
  12. >6.fasta.changes
  13. NC_045512.2:3885 NC_045512.2_pilon:3885 C A
  14. NC_045512.2:12778 NC_045512.2_pilon:12778 C T
  15. NC_045512.2:13270 NC_045512.2_pilon:13270 T C
  16. NC_045512.2:29449 NC_045512.2_pilon:29449 G T
  17. >7.fasta.changes
  18. NC_045512.2:1405 NC_045512.2_pilon:1405 T C
  19. NC_045512.2:9802 NC_045512.2_pilon:9802 G T
  20. NC_045512.2:19945 NC_045512.2_pilon:19945 A G
  21. NC_045512.2:25267 NC_045512.2_pilon:25267 C T
  22. NC_045512.2:27615 NC_045512.2_pilon:27615 C G
  23. >8.fasta.changes
  24. NC_045512.2:8782 NC_045512.2_pilon:8782 C T
  25. NC_045512.2:18060 NC_045512.2_pilon:18060 C T
  26. NC_045512.2:28144 NC_045512.2_pilon:28144 T C
  27. >9.fasta.changes
  28. NC_045512.2:22303 NC_045512.2_pilon:22303 T G
  29. >10.fasta.changes
  30. NC_045512.2:16114 NC_045512.2_pilon:16114 C T
  31. NC_045512.2:22205 NC_045512.2_pilon:22205 G C
  32. >11.fasta.changes
  33. NC_045512.2:22303 NC_045512.2_pilon:22303 T G
  34. NC_045512.2:27775 NC_045512.2_pilon:27775 T C
  35. NC_045512.2:27776 NC_045512.2_pilon:27776 T G
  36. NC_045512.2:27777 NC_045512.2_pilon:27777 G A

Through the changes file, we can find that there are 3 mutations in the first genome. More detailed information is stored in the 1.fasta.vcf file.

b. Find Variation Using MAFFT

  1. cat ~/SARS_CoV_2/Pilon/P3-*/pilon_polished*.fasta ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta > sequences12.fasta
  2. mafft --reorder --adjustdirection --auto --clustalout --thread 4 ~/SARS_CoV_2/mafft/sequences12.fasta > mut.aln

image.png

9. ML phylogenetic tree

a. Multiple sequence alignment

The assembled virus genomes was aligned with the previous genome using MAFFT. The previous genome sequences were obtained from the NCBI SARS-CoV-2 Resources(https://www.ncbi.nlm.nih.gov/sars-cov-2/) in fasta format.

  1. cd ~/SARS_CoV_2/mafft
  2. # Download 100 SARS-CoV-2 genome sequences randomly from SARS-CoV-2 Database.
  3. cat sequences.fasta ~/SARS_CoV_2/Pilon/P3-*/pilon_polished*.fasta ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta > sequences112.fasta
  4. ##---------------Multiple alignment
  5. mafft --reorder --adjustdirection --auto --clustalout --thread 4 ./sequences112.fasta > output.aln

Viral Sequencing Reads Analysis_pipeline - 图9

b. Construct an evolutionary tree

After alignment using MAFFT, the output.aln file can be used to construct an evolutionary tree using the GTR+I+G model in iqtree. Using their evolutionary tree, we can determine the chain of transmission in these 11 patients.

  1. iqtree -s./output.aln -m GTR+I+G

c. Visualize .treefile by tree viewer programs iTOL.

20210426214838.png
According to the evolutionary tree, the virus with genome ID MT066156.1 is most closely related to the 11 strains. MT066156.1 was isolated from the first case of COVID-19 in Italy. The patient came to Italy as tourist from Hubei province. They’re further away from the virus with genome ID NC045512.2. The 11 cases were indirectly infected by those people who linked to Huanan Seafood Market.