Samtools is a set of utilities that manipulate alignments in the SAM (Sequence Alignment/Map), BAM, and CRAM formats. It converts between the formats, does sorting, merging and indexing, and can retrieve reads in any regions swiftly.

1. Commands & options:

1.1 samtools view

View and convert SAM/BAM/CRAM files.

  1. -b # Output in the BAM format.
  2. -h # Include the header in the output.
  3. -S # Input is in SAM format.
  4. -F <INT> # Do not output alignments with any bits set in INT present in the FLAG field.
  5. -@ <INT> # Number of threads.
  6. -c # count reads and print the total number

1.2 samtools sort

The file size can be compressed after sorting. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.

  1. -l <INT>
  2. # Set the desired compression level for the final output file, ranging from 0 (uncompressed) or 1 (fastest but minimal compression) to 9 (best compression but slowest to write), similarly to gzip(1)'s compression level setting.
  3. -n # Sort by read names (i.e., the QNAME field) rather than by chromosomal coordinates.
  4. -m <INT> # Set maximum memory per thread; suffix K/M/G recognized [768M].
  5. -t <TAG> # Sort first by the value in the alignment tag TAG, then by position or name (if also using -n).
  6. -o <FILE> # output filename
  7. -O <FORMAT> # output file format

The -n and -t options are incompatible with samtools index.

1.3 smtools index

Index a coordinate-sorted BGZIP-compressed SAM, BAM or CRAM file for fast random access.

1.4 samtools flagstat

Counts the number of alignments for each FLAG type.

1.5 samtools depth

Compute the read depth at each position or region.

  1. -d/-m <int> # maximum coverage depth [default = 8000]. If 0, depth is set to the maximum integer value, effectively removing any depth limit.

1.6 samtools fastq/fasta

Convert a SAM/BAM/CRAM file to FASTQ or FASTA.

1.7 samtools markdup

Mark duplicate alignments in a coordinate sorted file.

  1. -r # Remove duplicate reads
  2. -l <INT> # Expected maximum read length of INT bases. [300]

2. Pipeline

2.1

  1. $ samtools view -b -F 12 ~/SARS_CoV_2/mapping/P3-VERO-P3-1-vero_L4.sam > ~/SARS_CoV_2/mapping/P3-VERO-P3-1-vero_L4.bam

-F 12: Don’t output unmapped reads. 4+8=12. View the meaning of the flag value: SAM/BAM.
Samtools - 图1

2.2

  1. $ samtools sort -@ 4 -l 9 ~/SARS_CoV_2/mapping/P3-VERO-P3-1-vero_L4.bam -o ~/SARS_CoV_2/sort/P3-VERO-P3-1-vero_L4_sort.bam
  2. $ samtools index ~/SARS_CoV_2/sort/P3-VERO-P3-1-vero_L4_sort.bam ~/SARS_CoV_2/sortedbam/P3-VERO-P3-1-vero_L4_sort.bam.bai

Samtools - 图2

2.3

  1. $ samtools flagstat ~/SARS_CoV_2/sort/P3-VERO-P3-1-vero_L4_sort.bam > flagstat-1.txt
  1. 473417027 + 0 in total (QC-passed reads + QC-failed reads)
  2. 3432805 + 0 secondary
  3. 0 + 0 supplementary
  4. 0 + 0 duplicates
  5. 473417027 + 0 mapped (100.00% : N/A)
  6. 469984222 + 0 paired in sequencing
  7. 234992111 + 0 read1
  8. 234992111 + 0 read2
  9. 462394014 + 0 properly paired (98.39% : N/A)
  10. 469984222 + 0 with itself and mate mapped
  11. 0 + 0 singletons (0.00% : N/A)
  12. 0 + 0 with mate mapped to a different chr
  13. 0 + 0 with mate mapped to a different chr (mapQ>=5)

2.4

  1. $ samtools depth -d 0 ~/SARS_CoV_2/sort/P3-VERO-P3-1-vero_L4_sort.bam > depth3-1.txt
  1. NC_045512.2 101 8004
  2. NC_045512.2 102 8005
  3. NC_045512.2 103 7996
  4. NC_045512.2 104 7998
  5. NC_045512.2 105 7999
  6. NC_045512.2 106 8000
  7. NC_045512.2 107 8001
  8. NC_045512.2 108 8002
  9. NC_045512.2 109 8003
  10. NC_045512.2 110 8004
  11. NC_045512.2 111 8004
  12. NC_045512.2 112 8004
  13. NC_045512.2 113 8003
  14. NC_045512.2 114 7994
  15. NC_045512.2 115 7993
  16. NC_045512.2 116 7994
  17. NC_045512.2 117 7992
  18. NC_045512.2 118 7992
  19. NC_045512.2 119 7992
  20. NC_045512.2 120 7975
  • If -d options is not be set, the result as follows:

    1. $ samtools depth ~/SARS_CoV_2/sort/P3-VERO-P3-1-vero_L4_sort.bam > depth1-1.txt
    1. NC_045512.2 101 1706050
    2. NC_045512.2 102 1708068
    3. NC_045512.2 103 1710832
    4. NC_045512.2 104 1711865
    5. NC_045512.2 105 1716288
    6. NC_045512.2 106 1724816
    7. NC_045512.2 107 1729485
    8. NC_045512.2 108 1737588
    9. NC_045512.2 109 1775730
    10. NC_045512.2 110 1778939
    11. NC_045512.2 111 1780036
    12. NC_045512.2 112 1780479
    13. NC_045512.2 113 1787108
    14. NC_045512.2 114 1793678
    15. NC_045512.2 115 1795985
    16. NC_045512.2 116 1797876
    17. NC_045512.2 117 1803406
    18. NC_045512.2 118 1806767
    19. NC_045512.2 119 1818003
    20. NC_045512.2 120 1816881
  • Calculate the average sequencing depth.

    1. $ samtools view P3-VERO-P3-1-vero_L4_sort.bam | awk '{print $10}' | grep -o '[ACGTN]' | wc -l
    2. # Divide the result by the base number of the reference sequence.

    More informations:

    http://www.htslib.org/doc/samtools.html