Find Variation Using Pilon

The input genome is the NC_045512.2 ref genome and the BAM file is a subset of alignment results with ref genome.

  1. java -Xmx64G -jar ~/miniconda3/envs/covid/share/pilon-1.23-3/pilon-1.23.jar \
  2. --genome ~/SARS_CoV_2/mapping/ref/Wuhan-Hu-1.fasta \
  3. --frags ~/SARS_CoV_2/subsample/P3-VERO-P3-${i}-vero_L4_sort_0.1.bam \
  4. --threads 32 \
  5. --changes \
  6. --output ${i}.fasta \
  7. --fix all,breaks\
  8. --vcf &> pilon${i}.log

· View .changes File

The .changes file contains a space-delimited record of every change made in the assembly as instructed by the --fix option.

  1. for i in {1..11}; do
  2. {
  3. echo ">${i}.fasta.changes" >> sum.txt
  4. cat ${i}.fasta.changes >>sum.txt
  5. }
  6. done
  1. >1.fasta.changes
  2. NC_045512.2:11083 NC_045512.2_pilon:11083 G T
  3. NC_045512.2:23607 NC_045512.2_pilon:23607 G A
  4. NC_045512.2:26144 NC_045512.2_pilon:26144 G T
  5. >2.fasta.changes
  6. NC_045512.2:8782 NC_045512.2_pilon:8782 C T
  7. NC_045512.2:28144 NC_045512.2_pilon:28144 T C
  8. >3.fasta.changes
  9. NC_045512.2:16846 NC_045512.2_pilon:16846 G T
  10. >4.fasta.changes
  11. >5.fasta.changes
  12. >6.fasta.changes
  13. NC_045512.2:3885 NC_045512.2_pilon:3885 C A
  14. NC_045512.2:12778 NC_045512.2_pilon:12778 C T
  15. NC_045512.2:13270 NC_045512.2_pilon:13270 T C
  16. NC_045512.2:29449 NC_045512.2_pilon:29449 G T
  17. >7.fasta.changes
  18. NC_045512.2:1405 NC_045512.2_pilon:1405 T C
  19. NC_045512.2:9802 NC_045512.2_pilon:9802 G T
  20. NC_045512.2:19945 NC_045512.2_pilon:19945 A G
  21. NC_045512.2:25267 NC_045512.2_pilon:25267 C T
  22. NC_045512.2:27615 NC_045512.2_pilon:27615 C G
  23. >8.fasta.changes
  24. NC_045512.2:8782 NC_045512.2_pilon:8782 C T
  25. NC_045512.2:18060 NC_045512.2_pilon:18060 C T
  26. NC_045512.2:28144 NC_045512.2_pilon:28144 T C
  27. >9.fasta.changes
  28. NC_045512.2:22303 NC_045512.2_pilon:22303 T G
  29. >10.fasta.changes
  30. NC_045512.2:16114 NC_045512.2_pilon:16114 C T
  31. NC_045512.2:22205 NC_045512.2_pilon:22205 G C
  32. >11.fasta.changes
  33. NC_045512.2:22303 NC_045512.2_pilon:22303 T G
  34. NC_045512.2:27775 NC_045512.2_pilon:27775 T C
  35. NC_045512.2:27776 NC_045512.2_pilon:27776 T G
  36. NC_045512.2:27777 NC_045512.2_pilon:27777 G A

· view .vcf file

The Variant Call Format (VCF) specifies the format of a text file used for storing gene sequence variations. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
Official document: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Example:

  1. $ cat 1.fasta.vcf | head -n 50
  2. ##fileformat=VCFv4.1
  3. ##fileDate=20210428
  4. ##source="Pilon version 1.23 Mon Nov 26 16:04:05 2018 -0500"
  5. ##PILON="--genome /public/home/ykk/SARS_CoV_2/Pilon/wuhan/Wuhan-Hu-1.fasta --frags /public/home/ykk/SARS_CoV_2/sortedbam/P3-VERO-P3-1-vero_L4_sort.bam --changes --output 1.fasta --threads 32 --fix all,breaks --vcf"
  6. ##reference=file:/public/home/ykk/SARS_CoV_2/Pilon/wuhan/Wuhan-Hu-1.fasta
  7. ##contig=<ID=NC_045512.2,length=29903>
  8. ##FILTER=<ID=LowCov,Description="Low Coverage of good reads at location">
  9. ##FILTER=<ID=Amb,Description="Ambiguous evidence in haploid genome">
  10. ##FILTER=<ID=Del,Description="This base is in a deletion or change event from another record">
  11. ##INFO=<ID=DP,Number=1,Type=Integer,Description="Valid read depth; some reads may have been filtered">
  12. ##INFO=<ID=TD,Number=1,Type=Integer,Description="Total read depth including bad pairs">
  13. ##INFO=<ID=PC,Number=1,Type=Integer,Description="Physical coverage of valid inserts across locus">
  14. ##INFO=<ID=BQ,Number=1,Type=Integer,Description="Mean base quality at locus">
  15. ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Mean read mapping quality at locus">
  16. ##INFO=<ID=QD,Number=1,Type=Integer,Description="Variant confidence/quality by depth">
  17. ##INFO=<ID=BC,Number=4,Type=Integer,Description="Count of As, Cs, Gs, Ts at locus">
  18. ##INFO=<ID=QP,Number=4,Type=Integer,Description="Percentage of As, Cs, Gs, Ts weighted by Q & MQ at locus">
  19. ##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of reads with insertion here">
  20. ##INFO=<ID=DC,Number=1,Type=Integer,Description="Number of reads with deletion here">
  21. ##INFO=<ID=XC,Number=1,Type=Integer,Description="Number of reads clipped here">
  22. ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
  23. ##INFO=<ID=AF,Number=A,Type=Float,Description="Fraction of evidence in support of alternate allele(s)">
  24. ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
  25. ##INFO=<ID=SVLEN,Number=.,Type=String,Description="Difference in length between REF and ALT alleles">
  26. ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
  27. ##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise change from local reassembly (ALT contains Ns)">
  28. ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
  29. ##FORMAT=<ID=AD,Number=.,Type=String,Description="Allelic depths for the ref and alt alleles in the order listed">
  30. ##FORMAT=<ID=DP,Number=1,Type=String,Description="Approximate read depth; some reads may have been filtered">
  31. ##ALT=<ID=DUP,Description="Possible segmental duplication">
  32. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
  33. NC_045512.2 1 . A . 359 LowCov DP=10;TD=2439;BQ=36;MQ=34;QD=35;BC=10,0,0,0;QP=100,0,0,0;PC=11;IC=0;DC=0;XC=29;AC=0;AF=0.00 GT0/0
  34. NC_045512.2 2 . T . 370 LowCov DP=10;TD=2473;BQ=37;MQ=34;QD=37;BC=0,0,0,10;QP=0,0,0,100;PC=11;IC=0;DC=0;XC=57;AC=0;AF=0.00 GT0/0
  35. …………
  36. NC_045512.2 11083 . G T 11822391 PASS DP=453118;TD=466945;BQ=36;MQ=60;QD=30;BC=178,42,363,385402;QP=0,0,0,100;PC=410199;IC=10;DC=67133;XC=3036;AC=2;AF=1.00 GT1/1
  37. …………
  38. NC_045512.2 23607 . G A 78429249 PASS DP=2706046;TD=2821970;BQ=35;MQ=60;QD=28;BC=2582038,3758,118877,1272;QP=95,0,4,0;PC=2977892;IC=1;DC=101;XC=29688;AC=2;AF=0.83 GT1/1
  39. …………
  40. NC_045512.2 26144 . G T 56706266 PASS DP=1578759;TD=1610205;BQ=36;MQ=60;QD=35;BC=1041,234,2028,1575401;QP=0,0,0,100;PC=1396070;IC=82;DC=55;XC=3296;AC=2;AF=0.00 GT1/1
  41. …………

GT1/1 indicates that the site in the sample is a mutation, and mutation type is the same as the column ALT‘s base type. The 1.fasta.vcf file above shows that there are 3 mutations in the first genome, which is consistent with the content of the 1.fasta.changes file.

4. Visualize vcf file.

  1. git clone https://github.com/BimberLab/DISCVRSeq/
  2. git clone https://github.com/broadinstitute/picard.git
  3. conda install gatk4 -y
  • Build index for ref genome.

    1. samtools faidx Wuhan-Hu-1.fasta
    2. java -jar ~/software/picard.jar CreateSequenceDictionary R=Wuhan-Hu-1.fasta O=Wuhan-Hu-1.dict
  • Build index for vcf file using GATK4.

    1. java -jar ~/miniconda3/envs/covid/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar IndexFeatureFile -I ~/SARS_CoV_2/Pilon/1.fasta.vcf
  • Visualize vcf file.

    1. java -jar ~/software/DISCVRSeq-1.26.jar VariantQC -O 1.html -R Wuhan-Hu-1.fasta -V ~/SARS_CoV_2/Pilon/1.fasta.vcf

Identify mutations - 图1
Identify mutations - 图2
The report demonstrates that there are three mutation sites in the first genome. They are all point mutation.

  • Identify mutations in the viral genomes by comparing to the **NC_045512.2** ref genome.

image.png