Viral Sequencing Reads Analysis_pipeline

The Sequence File
Workflow
- 1. Install software
———————-Create a environment
Activate covid environment

Since December 2019, the COVID-19 caused by a RNA coronavirus, SARS-CoV-2, has spread rapidly worldwide. Viruses can mutate and evolve rapidly during replication and spread. Therefore, studying the evolution and identifying the substitutions of SARS-COV-2 are essential for COVID-19 control. Previously, SARS-CoV-2 isolates from 11 patients were collected and the sequencing data were obtained by next-generation sequencing. In this workflow, the viral sequences were assembled and then compared with sequences in database to check its mutation. According to the phylogenetic analysis of SARS-CoV-2, the evolutionary history can be reconstructed to identify the virus infection source and transmission chain.

The Sequence File

These sequencing files were obtained from viral genome sequencing. These viruses were isolated from 11 COVID-19 patients. The sequence files we got are in FASTQ format. Then use the following code to view one of these files.

zless -S P3-VERO-P3-1-vero_L4_1.fq.gz | head -n 8

A FASTQ file normally uses four lines for storing sequencing information per sequence.
```
@A00821:275:HWMMWDSXX:4:1101:1298:1016 1:N:0:ATTACTCG+TAGATCGC  
GCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTT
+
FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
```
- This first line always starts with a @ character and is followed by a sequence identifier.
- The second line is the raw sequence letters.
- The last line encodes the quality values for the sequence in the second line. If p is the error probability, then: . Q scores are often represented as ASCII characters.
  Workflow
  1. Install software
  During the analysis process, we need many tools to analyze our data. These tools can be downloaded using Conda. Conda is a powerful package manager and environment manager. Conda quickly installs, runs and updates packages and their dependencies, as well, it easily creates, saves, loads, and switches between environments. ```shell
  ———————-Create a environment
  conda create -y -n covid python=3
  Activate covid environment
  conda activate covid

———————-Install software in ‘covid’ environment

conda install -y fastp trim-galore bwa samtools sambamba spades quast pilon mafft iqtree

<a name="Os74W"></a>
### 2. Create and manage directories
```shell
mkdir SARS_CoV_2
cd SARS_CoV_2
mkdir {raw_data,qc,clean_data,mapping,sort,subsample,fastq,genome,quast,Pilon,mafft}

3. Quality Control

Before further analysis, low-quality regions and adaptors should be trimmed and removed by Fastp or trim_galore from raw sequencing data. This step is important to avoid issues in downstream analyses such as variant analysis or assembly.

cd ~/SARS_CoV_2/qc
##---------------fastp
for file in $(ls ~/SARS_CoV_2/raw_data/*_1.fq.gz)
do
fullname=${file##*/}
fastp -i ${file%1.*}1.fq.gz \ 
  -o ~/SARS_CoV_2/clean_data/${fullname%1.*}1.fq.gz \
  -I ${file%1.*}2.fq.gz \
  -O ~/SARS_CoV_2/clean_data/${fullname%1.*}1.fq.gz \
  -w 4 \
  --html ${fullname%_1.*}.html \
  --json ${fullname%_1.*}.json
done
or
##---------------trim_galore
for file in $(ls ~/SARS_CoV_2/raw_data/*_1.fq.gz)
do
trim_galore --output_dir ~/SARS_CoV_2/clean_data \
    --phred33 --paired \
  --length 75 --quality 30 --stringency 5 \
  ${file%1.*}1.fq.gz ${file%1.*}2.fq.gz
done

In this period, we chose Fastp to filter the raw data. Fastp per-cycle quality profiles, per-cycle base contents, adapter trimming results, and k-mer counts. These values will be merged after all reads are processed, and a reporter will generate reports in HTML and JSON formats. Fastp reports statistical values for pre-filtering and post-filtering data to facilitate comparisons of changes in data quality after filtering is complete.

HTML report

Viral Sequencing Reads Analysis_pipeline - 图2

4. Map sequence

Paired reads were aligned to the 29903nt Wuhan reference genome (NCBI NC_045512.2) using BWA.

##---------------Build index
cd ~/SARS_CoV_2/mapping
mkdir ref
# Download the reference genome from the URL: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta, and rename the fasta file to 'Wuhan-Hu-1.fasta'
bwa index Wuhan-Hu-1.fasta

##---------------Map sequence by bwa-mem
cd ~/SARS_CoV_2/mapping
for file in $(ls ~/SARS_CoV_2/clean_data/*_1.fq.gz)
do
fullname=${file##*/}
INDEX=~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta
bwa mem -t 4 -M $INDEX \
    ${file%_1.*}_1.fq.gz \
  ${file%_1.*}_2.fq.gz \
  > ~/SARS_CoV_2/mapping/${fullname%_1.*}.sam \
  2> ~/SARS_CoV_2/mapping/${fullname%_1.*}.log
done
for file in $(ls ~/SARS_CoV_2/mapping/*.sam)
do
fullname=${file##*/}
samtools view -b -F 12 $file > ${file%.sam*}.bam &&\
samtools sort -@ 4 -l 9 ${file%.sam*}.bam -o ~/SARS_CoV_2/sort/${fullname%.sam*}_sort.bam &&\
samtools index ~/SARS_CoV_2/sort/${fullname%.sam*}_sort.bam  ~/SARS_CoV_2/sort/${fullname%.sam*}_sort.bam.bai
done
##---------------Count the number of alignments
cd ~/SARS_CoV_2/sort
for file in $(ls ~/SARS_CoV_2/sort/*.bam)
do
fullname=${file##*/}
samtools flagstat $file > ${fullname%.bam*}_flagstat.txt
done
##---------------Calculate the sequencing depth
for file in $(ls ~/SARS_CoV_2/sort/*.bam)
do
fullname=${file##*/}
samtools depth -d 0 $file > ${fullname%.bam*}_depth.txt
done

5. Assemble genome

SPAdes is an assembly toolkit mainly used for small genomes, so it’s a good choice for assembling viral genomes. SPAdes contains various assembly pipelines. In this step, we chose the pipeline, metaSPAdes. In the next part, we would compare the different effects between the metaSPAdes and others.

##---------------Subsample reads
cd ~/SARS_CoV_2/subsample
for file in $(ls ~/SARS_CoV_2/sort/*.bam)
do
fullname=${file##*/}
sambamba view -h -s 0.001 $file -o ~/SARS_CoV_2/subsample/${fullname%-vero*}-0.001.bam
done
##---------------Sort and build index
for file in $(ls ~/SARS_CoV_2/subsample/*.bam)
do
samtools sort -@ 4 -l 9 $file -o ${file%.bam}_sort.bam &&\
samtools index ${file%.bam}_sort.bam ${file%.bam}_sort.bam.bai
done
##---------------Convert the BAM to FASTQ
cd ~/SARS_CoV_2/fastq
for file in $(ls ~/SARS_CoV_2/subsample/*_sort.bam)
do
fullname=${file##*/}
samtools fastq $file > ~/SARS_CoV_2/fastq/${fullname%_sort.bam*}.fq
done
##---------------Assemble
cd ~/SARS_CoV_2/genome
for file in $(ls ~/SARS_CoV_2/fastq/*.fq)
do
fullname=${file##*/}
spades.py --meta --12 $file -o ~/SARS_CoV_2/genome/${fullname%.fq*}_meta.fasta
done

6. Evaluate assembly

To evaluate the effect of assembly, we used MetaQUAST to compare the assembled genome with the reference genome(NCBI NC_045512.2).

N50

For genome assembly results, we usually evaluate them by N50 and other indicators. For an assembled sequence, the sequences are sorted from the largest to the smallest according to their length, and then the lengths are accumulated starting from the first sequence until the cumulative length exceeds 50% of the total length. At this point, the length of the last contig accumulated is the length of N50. The schematic diagram is as follows:

a. `scaffolds.fasta`/`contigs.fasta`

cbin

python ~/miniconda3/envs/covid/bin/metaquast.py scaffolds.fasta contigs.fasta -r ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta -t 4 -o result_version

Viral Sequencing Reads Analysis_pipeline - 图5
The N50 is not shown in the screenshot, but according to the definition of N50, assembled genomes can be measured by the ratio of the largest alignment to the total aligned length. The higher the ratio, the better. scaffolds.fasta and contigs.fasta have no difference in the assembly results. We can pick any one of them for subsequent analysis.

b. Different pipelines in `MAFFT`

When we assemble a genome using MAFFT, there are three options(—meta/—sc/—iso) we can use.
Viral Sequencing Reads Analysis_pipeline - 图6
When using the metaSPAdes pipeline, the ratio of the largest alignment to the total aligned length is higher.

c. Different subsampling coverage

Viral Sequencing Reads Analysis_pipeline - 图7
Before assembling the genome, we subsampled our reads to get better results. Using different subsampling coverage makes a difference. For the first genome, 0.006 is a better fraction. However, for other genomes, it could be another one.

7. Improve assemblies

Pilon is a software tool that can be used to automatically improve draft assemblies by comparing the raw sequenced data with the assembled genome.

##---------------Build index
cd ~/SARS_CoV_2/Pilon
for i in {1..11}
do
mkdir P3-${i}
bwa index -p P3-${i}/P3-${i}_meta ~/SARS_CoV_2/genome/*${i}*.fasta
done
##---------------Align and sort
for i in {1..11}
do
bwa mem -t 20 ~/SARS_CoV_2/Pilon/P3-${i}/P3-${i}_meta \
  ~/SARS_CoV_2/clean_data/*-${i}-*_1.fq.gz \
  ~/SARS_CoV_2/clean_data/*-${i}-*_2.fq.gz \
  | samtools sort -@ 10 -O bam -o ./P3-${i}/P3-${i}_meta.bam &&\
  samtools index -@ 10 ./P3-${i}/P3-${i}_meta.bam
done
##---------------Mark duplication
for i in {1..11}
do
sambamba markdup -t 20 --hash-table-size=1000000 --overflow-list-size=1000000 ./P3-${i}/P3-${i}_meta.bam ./P3-${i}/P3-${i}_markdup.bam
done
##---------------Filter
for i in {1..11}
do
samtools view -@ 10 -q 30 -b -h ./P3-${i}/P3-${i}_markdup.bam > ./P3-${i}/P3-${i}_filter.bam &&\
samtools index -@ 10 ./P3-${i}/P3-${i}_filter.bam
done
##---------------Improve assemblies
for i in {1..11}
do
java -Xmx16G -jar ~/miniconda3/envs/covid/share/pilon-1.23-3/pilon-1.23.jar \
        --genome ./P3-${i}/P3-${i}_meta.fasta \
    --frags ./P3-${i}/P3-${i}_filter.bam \
    --changes --fix all,breaks\
    --output ./P3-${i}/pilon_polished_${i} \
    --vcf &> ./P3-${i}/pilon${i}.log
done

8. Identify mutations

a. Find Variation Using `Pilon`

Input to Pilon consists of the NC_045512.2 genome in FASTA format along with the BAM files of sequencing reads aligned to the NC_045512.2 genome.

##---------------Identify mutations
for i in {1..11}
do
java -Xmx64G -jar ~/miniconda3/envs/covid/share/pilon-1.23-3/pilon-1.23.jar \
  --genome ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta \
  --frags ~/SARS_CoV_2/subsample/*${i}*_sort.bam \
  --changes \
  --output ${i}.fasta \
  --fix all,breaks\
done

· View **.changes** File
The .changes file contains a space-delimited record of every change made in the assembly as instructed by the --fix option.

for i in {1..11}
do
    echo ">${i}.fasta.changes" >> sum.txt
  cat ${i}.fasta.changes >>sum.txt
done

>1.fasta.changes
NC_045512.2:11083    NC_045512.2_pilon:11083 G T
NC_045512.2:23607    NC_045512.2_pilon:23607 G A
NC_045512.2:26144    NC_045512.2_pilon:26144 G T
>2.fasta.changes
NC_045512.2:8782    NC_045512.2_pilon:8782 C T
NC_045512.2:28144    NC_045512.2_pilon:28144 T C
>3.fasta.changes
NC_045512.2:16846 NC_045512.2_pilon:16846 G T
>4.fasta.changes
>5.fasta.changes
>6.fasta.changes
NC_045512.2:3885 NC_045512.2_pilon:3885 C A
NC_045512.2:12778 NC_045512.2_pilon:12778 C T
NC_045512.2:13270 NC_045512.2_pilon:13270 T C
NC_045512.2:29449 NC_045512.2_pilon:29449 G T
>7.fasta.changes
NC_045512.2:1405 NC_045512.2_pilon:1405 T C
NC_045512.2:9802 NC_045512.2_pilon:9802 G T
NC_045512.2:19945 NC_045512.2_pilon:19945 A G
NC_045512.2:25267 NC_045512.2_pilon:25267 C T
NC_045512.2:27615 NC_045512.2_pilon:27615 C G
>8.fasta.changes
NC_045512.2:8782 NC_045512.2_pilon:8782 C T
NC_045512.2:18060 NC_045512.2_pilon:18060 C T
NC_045512.2:28144 NC_045512.2_pilon:28144 T C
>9.fasta.changes
NC_045512.2:22303 NC_045512.2_pilon:22303 T G
>10.fasta.changes
NC_045512.2:16114 NC_045512.2_pilon:16114 C T
NC_045512.2:22205 NC_045512.2_pilon:22205 G C
>11.fasta.changes
NC_045512.2:22303 NC_045512.2_pilon:22303 T G
NC_045512.2:27775 NC_045512.2_pilon:27775 T C
NC_045512.2:27776 NC_045512.2_pilon:27776 T G
NC_045512.2:27777 NC_045512.2_pilon:27777 G A

Through the changes file, we can find that there are 3 mutations in the first genome. More detailed information is stored in the 1.fasta.vcf file.

b. Find Variation Using `MAFFT`

cat ~/SARS_CoV_2/Pilon/P3-*/pilon_polished*.fasta ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta > sequences12.fasta
mafft --reorder --adjustdirection --auto --clustalout --thread 4 ~/SARS_CoV_2/mafft/sequences12.fasta > mut.aln

9. ML phylogenetic tree

a. Multiple sequence alignment

The assembled virus genomes was aligned with the previous genome using MAFFT. The previous genome sequences were obtained from the NCBI SARS-CoV-2 Resources(https://www.ncbi.nlm.nih.gov/sars-cov-2/) in fasta format.

cd ~/SARS_CoV_2/mafft
# Download 100 SARS-CoV-2 genome sequences randomly from SARS-CoV-2 Database.
cat sequences.fasta ~/SARS_CoV_2/Pilon/P3-*/pilon_polished*.fasta ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta > sequences112.fasta
##---------------Multiple alignment
mafft --reorder --adjustdirection --auto --clustalout --thread 4 ./sequences112.fasta > output.aln

Viral Sequencing Reads Analysis_pipeline - 图9

b. Construct an evolutionary tree

After alignment using MAFFT, the output.aln file can be used to construct an evolutionary tree using the GTR+I+G model in iqtree. Using their evolutionary tree, we can determine the chain of transmission in these 11 patients.

iqtree -s./output.aln -m GTR+I+G

c. Visualize `.treefile` by tree viewer programs iTOL.

According to the evolutionary tree, the virus with genome ID MT066156.1 is most closely related to the 11 strains. MT066156.1 was isolated from the first case of COVID-19 in Italy. The patient came to Italy as tourist from Hubei province. They’re further away from the virus with genome ID NC045512.2. The 11 cases were indirectly infected by those people who linked to Huanan Seafood Market.

The Sequence File

Workflow

1. Install software

———————-Create a environment

Activate covid environment