Pilon - Identify mutations - 《linux assignment》

Find Variation Using Pilon
- · View .changes File
- · view .vcf file
  - Example:
4. Visualize vcf file.

Find Variation Using `Pilon`

The input genome is the NC_045512.2 ref genome and the BAM file is a subset of alignment results with ref genome.

java -Xmx64G -jar ~/miniconda3/envs/covid/share/pilon-1.23-3/pilon-1.23.jar \
    --genome ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta \
    --frags ~/SARS_CoV_2/subsample/P3-VERO-P3-${i}-vero_L4_sort_0.1.bam \
    --threads 32 \
    --changes \
    --output ${i}.fasta \
    --fix all,breaks\
    --vcf &> pilon${i}.log

· View `.changes` File

The .changes file contains a space-delimited record of every change made in the assembly as instructed by the --fix option.

for i in {1..11}; do
    {
        echo ">${i}.fasta.changes" >> sum.txt
        cat ${i}.fasta.changes >>sum.txt
    }
done

>1.fasta.changes
NC_045512.2:11083 NC_045512.2_pilon:11083 G T
NC_045512.2:23607 NC_045512.2_pilon:23607 G A
NC_045512.2:26144 NC_045512.2_pilon:26144 G T
>2.fasta.changes
NC_045512.2:8782 NC_045512.2_pilon:8782 C T
NC_045512.2:28144 NC_045512.2_pilon:28144 T C
>3.fasta.changes
NC_045512.2:16846 NC_045512.2_pilon:16846 G T
>4.fasta.changes
>5.fasta.changes
>6.fasta.changes
NC_045512.2:3885 NC_045512.2_pilon:3885 C A
NC_045512.2:12778 NC_045512.2_pilon:12778 C T
NC_045512.2:13270 NC_045512.2_pilon:13270 T C
NC_045512.2:29449 NC_045512.2_pilon:29449 G T
>7.fasta.changes
NC_045512.2:1405 NC_045512.2_pilon:1405 T C
NC_045512.2:9802 NC_045512.2_pilon:9802 G T
NC_045512.2:19945 NC_045512.2_pilon:19945 A G
NC_045512.2:25267 NC_045512.2_pilon:25267 C T
NC_045512.2:27615 NC_045512.2_pilon:27615 C G
>8.fasta.changes
NC_045512.2:8782 NC_045512.2_pilon:8782 C T
NC_045512.2:18060 NC_045512.2_pilon:18060 C T
NC_045512.2:28144 NC_045512.2_pilon:28144 T C
>9.fasta.changes
NC_045512.2:22303 NC_045512.2_pilon:22303 T G
>10.fasta.changes
NC_045512.2:16114 NC_045512.2_pilon:16114 C T
NC_045512.2:22205 NC_045512.2_pilon:22205 G C
>11.fasta.changes
NC_045512.2:22303 NC_045512.2_pilon:22303 T G
NC_045512.2:27775 NC_045512.2_pilon:27775 T C
NC_045512.2:27776 NC_045512.2_pilon:27776 T G
NC_045512.2:27777 NC_045512.2_pilon:27777 G A

· view `.vcf` file

The Variant Call Format (VCF) specifies the format of a text file used for storing gene sequence variations. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
Official document: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Example:

$ cat 1.fasta.vcf | head -n 50
##fileformat=VCFv4.1
##fileDate=20210428
##source="Pilon version 1.23 Mon Nov 26 16:04:05 2018 -0500"
##PILON="--genome /public/home/ykk/SARS_CoV_2/Pilon/wuhan/Wuhan-Hu-1.fasta --frags /public/home/ykk/SARS_CoV_2/sortedbam/P3-VERO-P3-1-vero_L4_sort.bam --changes --output 1.fasta --threads 32 --fix all,breaks --vcf"
##reference=file:/public/home/ykk/SARS_CoV_2/Pilon/wuhan/Wuhan-Hu-1.fasta
##contig=<ID=NC_045512.2,length=29903>
##FILTER=<ID=LowCov,Description="Low Coverage of good reads at location">
##FILTER=<ID=Amb,Description="Ambiguous evidence in haploid genome">
##FILTER=<ID=Del,Description="This base is in a deletion or change event from another record">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Valid read depth; some reads may have been filtered">
##INFO=<ID=TD,Number=1,Type=Integer,Description="Total read depth including bad pairs">
##INFO=<ID=PC,Number=1,Type=Integer,Description="Physical coverage of valid inserts across locus">
##INFO=<ID=BQ,Number=1,Type=Integer,Description="Mean base quality at locus">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Mean read mapping quality at locus">
##INFO=<ID=QD,Number=1,Type=Integer,Description="Variant confidence/quality by depth">
##INFO=<ID=BC,Number=4,Type=Integer,Description="Count of As, Cs, Gs, Ts at locus">
##INFO=<ID=QP,Number=4,Type=Integer,Description="Percentage of As, Cs, Gs, Ts weighted by Q & MQ at locus">
##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of reads with insertion here">
##INFO=<ID=DC,Number=1,Type=Integer,Description="Number of reads with deletion here">
##INFO=<ID=XC,Number=1,Type=Integer,Description="Number of reads clipped here">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Fraction of evidence in support of alternate allele(s)">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=String,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise change from local reassembly (ALT contains Ns)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=String,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Approximate read depth; some reads may have been filtered">
##ALT=<ID=DUP,Description="Possible segmental duplication">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    SAMPLE
NC_045512.2    1    .    A    .    359    LowCov    DP=10;TD=2439;BQ=36;MQ=34;QD=35;BC=10,0,0,0;QP=100,0,0,0;PC=11;IC=0;DC=0;XC=29;AC=0;AF=0.00    GT0/0
NC_045512.2    2    .    T    .    370    LowCov    DP=10;TD=2473;BQ=37;MQ=34;QD=37;BC=0,0,0,10;QP=0,0,0,100;PC=11;IC=0;DC=0;XC=57;AC=0;AF=0.00    GT0/0
…………
NC_045512.2    11083    .    G    T    11822391    PASS    DP=453118;TD=466945;BQ=36;MQ=60;QD=30;BC=178,42,363,385402;QP=0,0,0,100;PC=410199;IC=10;DC=67133;XC=3036;AC=2;AF=1.00    GT1/1
…………
NC_045512.2    23607    .    G    A    78429249    PASS    DP=2706046;TD=2821970;BQ=35;MQ=60;QD=28;BC=2582038,3758,118877,1272;QP=95,0,4,0;PC=2977892;IC=1;DC=101;XC=29688;AC=2;AF=0.83    GT1/1
…………
NC_045512.2    26144    .    G    T    56706266    PASS    DP=1578759;TD=1610205;BQ=36;MQ=60;QD=35;BC=1041,234,2028,1575401;QP=0,0,0,100;PC=1396070;IC=82;DC=55;XC=3296;AC=2;AF=0.00    GT1/1
…………

GT1/1 indicates that the site in the sample is a mutation, and mutation type is the same as the column ALT‘s base type. The 1.fasta.vcf file above shows that there are 3 mutations in the first genome, which is consistent with the content of the 1.fasta.changes file.

4. Visualize `vcf` file.

Install DISCVR-seq, Picard and gatk4 toolkit.

https://github.com/broadinstitute/picard

git clone https://github.com/BimberLab/DISCVRSeq/
git clone https://github.com/broadinstitute/picard.git
conda install gatk4 -y

Build index for ref genome.

samtools faidx Wuhan-Hu-1.fasta
java -jar ~/software/picard.jar CreateSequenceDictionary R=Wuhan-Hu-1.fasta O=Wuhan-Hu-1.dict

Build index for vcf file using GATK4.

java -jar ~/miniconda3/envs/covid/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar IndexFeatureFile -I ~/SARS_CoV_2/Pilon/1.fasta.vcf

Visualize vcf file.

java -jar ~/software/DISCVRSeq-1.26.jar VariantQC -O 1.html -R Wuhan-Hu-1.fasta -V ~/SARS_CoV_2/Pilon/1.fasta.vcf

Identify mutations - 图1
Identify mutations - 图2
The report demonstrates that there are three mutation sites in the first genome. They are all point mutation.

Identify mutations in the viral genomes by comparing to the **NC_045512.2** ref genome.

Identify mutations

Find Variation Using Pilon

· View .changes File

· view .vcf file

Example:

4. Visualize vcf file.

Find Variation Using `Pilon`

· View `.changes` File

· view `.vcf` file

4. Visualize `vcf` file.