05_Sequencing_and_FastQ


By Yan Li

PhD in Bioinformatics, University of Liverpool

Workflow


graph TB
    A(DNA extraction)
    A --> B
    B(Sequencing)
    B --> C
    C(Trimming and QC)
    C --> D
    D(Assembly)
    D --> E
    E(Annotation or Other analysis)

Sequencing: an overview


Gene Sqeuencing DevelopmentThe Key PlayersKey TechnologyKey ProductProduct Release Time
Sanger's SequencingABIChain Termination MethodABI 37301987
Next Generation Sequencing (NGS)IlluminaSequencing by SynthesisHiseq, Miseq2006
Third Generation SequencingPacific BiosciencesSMRT TechnologyPacBio RS, PacBio RS II2013
Oxford Nanopore TechnologyNanopore TechnologyMinION2014

Sequencing: an overview



Sequencing

Sanger Sequencing


  • Fred Sanger
  • 1977
  • Uses extension-terminating dideoxynucleotides
  • Then
    • 1st human genome
    • 13 years (1990 - 2003)
    • $2.7 billion

Illumina


  • Pro
    • High throughput, low cost
  • Con
    • Limited read length hampers complex genome feature (e.g. repeats, low coverage, structural variation) reconstruction in assembly (Partially overcome by paired-end reads with known insert size)
    • Takes long time
    • Expensive infrastructure

PacBio


  • Better assembly
  • Lower first pass accuracy
  • Faster because sequencing in real time
  • Directly detect base modifications

Fastq file


@IL7_1788:5:1:59:769/1
GTGGTCAGTGATTTGCAGGAGGGCACCGGGCCCGTAGATTGCGGCGGCTGGTTAGTGGATGTGTGCGATGCGTTAACCGATCACGCCAGTGAATTTATTGA
+
GGAGAGGG<GGIGIIGIIGGAGGGGGGGGG<AGGGGGGGGGGGGGGGGGGIGGGGGGG<GGGGGIIG<<GAG.AAGGIIIIIGGGAGGGGIGGAGGGGIAG
@IL7_1788:5:1:150:908/18
CCACGCCACAGACCGCTATCAGTCGTCCTTCGCGTATCGCACCCTTAATGTCTTTCATCAGCTGCTTATGGTGGGCAGTTTCATAATACCCGGCCTGTTCA
+
GGGGGGGIIIIIIIGIIGGGIIIGGGGGGGGGGGIGIGGGIIIIIIGGIIIIIIGGGGGIIGIGIIIIGIGGIIIGIIIIIIIGGGGIGGGGGIGGGGGGI

fastq

Sequence header

Sequence headerMeaning
@IL7_1788instrument name (unique)
5flowcell lane
1tile number within flow cell
59x-coordinate of cluster within tile
769y-coordinate of cluster within tile
/1member of a pair (/1/2)

Trimming

trimming

  • Need to remove all the adaptors, sequencing primer sites, indices
  • Sequencing quality based trimming

https:://sg.idtdna.com/pages/products/next-generation-sequencing/adapters

Workshop


We will do

  • View the raw reads file
  • Trim the raw reads: trimmomatic and seqtk
  • Quality assessment: fastqc