1. FASTQ files

FASTQ files are compressed and created with the extension _*.fq.gz._

  • View the FASTQ file

    1. $ zless -S P3-VERO-P3-1-vero_L4_1.fq.gz | head -n 4
    1. @A00821:275:HWMMWDSXX:4:1101:1298:1016 1:N:0:ATTACTCG+TAGATCGC
    2. GCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTT
    3. +
    4. FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • Each entry in a FASTQ file consists of four lines:

    • A sequence identifier with information about the sequencing run and the cluster. This line always starts with @.

      1. ATTACTCG+TAGATCGC # index sequence
    • The sequence (the base calls; A, C, T, G and N).

    • A separator, which is simply a plus (+) sign.
    • The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.
      • Relationship Between Sequencing Quality Score and Base Call Accuracy.

Flitering data (Fastp) - 图1

Quality Score Probability of Incorrect Base Call Inferred Base Call Accuracy
10 (Q10) 1 in 10 90%
20 (Q20) 1 in 100 99%
30 (Q30) 1 in 1000 99.9%
  1. - **Higher Q scores **indicate a smaller probability of error.

2. Fastp

Fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files.

Fastp can do comprehensive quality profiling for both before and after filtering data, meanwhile, it removes low-quality reads and bases. Most of the fastp‘s functions do not need to enter too many parameters. Some features are turned on by default, otherwise can be turned off with parameters.

2.1 Common Options:

  1. -i <read1 input file name>
  2. -o <read1 output file name>
  3. -I <read2 input file name>
  4. -O <read2 output file name>
  5. # Fastp supports the input and output of *.gz.
  6. -h <html format report name>
  7. -j <json format report name>
  8. -w <int> # --threads

Fastp supports both single-end (SE) and paired-end (PE) input/output.

  1. # paired-end
  2. fastp -i ~/SARS_CoV_2/raw_data/P3-VERO-P3-1-vero_L4_1.fq.gz \
  3. -o ~/SARS_CoV_2/clean_data/P3-VERO-P3-1-vero_L4_1.fq.gz \
  4. -I ~/SARS_CoV_2/Fasta_file/P3-VERO-P3-1-vero_L4_2.fq.gz \
  5. -O ~/SARS_CoV_2/clean_data/P3-VERO-P3-1-vero_L4_2.fq.gz \
  6. -w 4 \
  7. --html 1.html \
  8. --json 1.json

2.2 Result

  • .fq.gz file

    • Raw data:

      1. @A00821:275:HWMMWDSXX:4:1101:1298:1016 1:N:0:ATTACTCG+TAGATCGC
      2. GCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTT
      3. +
      4. FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      5. @A00821:275:HWMMWDSXX:4:1101:1479:1016 1:N:0:ATTACTCG+TAGATCGC
      6. GCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCAGATCGGAAGAGCA
      7. +
      8. FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFF
    • Clean data:

      1. @A00821:275:HWMMWDSXX:4:1101:1298:1016 1:N:0:ATTACTCG+TAGATCGC
      2. GCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTT
      3. +
      4. FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      5. @A00821:275:HWMMWDSXX:4:1101:1479:1016 1:N:0:ATTACTCG+TAGATCGC
      6. GCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCC
      7. +
      8. FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFFFFFFFF:FFFFFFFFFFFFF

      the second read was trimmed its adapter, AGATCGGAAGAGCA.

    • **.html** report

    • The summary shows statistics before and after filtering

Flitering data (Fastp) - 图2

  • Adapters

Adapter sequences can be automatically detected for both PE/SE data.
Flitering data (Fastp) - 图3

3. More Information:

https://github.com/OpenGene/fastp
https://blog.csdn.net/twocanis/article/details/109681242