07_Genome_assembly


By Yan Li

PhD in Bioinformatics, University of Liverpool

Introduction


  • Assembly: from short reads to long contigs
  • Two types of genome assembly:
    • de novo
    • re-sequencing (reference-guided)
  • We will focus on de novo of bacterial genome

Why


  • Make a reference genome (when we are not already have one)
  • Look at genome structure
  • Put features into context
  • To make comparison to other genomes

Technology choice


  • Different assembler based on the sequencing platform
  • PacBio & Nanopore
    • Flye
  • Illumina
    • Spades
    • Velvet

Factors dictate assemly quality


  • Read length and coverage
  • Sequencing data quality
  • Genome complexity

Coverage / Depth


coverage

  • Usually expressed as 30x, 100x, etc
  • Low coverage cause some genome regions have no reads
  • Short reads length may make repeat regions impossible to recover

\frac{N(number\space of\space reads) * L (read\space length)}

https:://katex.org/#demo

Terminology


TermDefinition
assemblyreconstructing a genome sequence from raw reads
readfragments of genome generated by a sequencer
coveragethe average number of reads that align to known reference bases
contiga contigious sequence built from overlapping reads
scaffoldsets of overlapping contigs separated by gaps of known length
graphrepresents relationships using nodes and edges

assembly

Graph theory: the Seven Bridges of Königsberg


seven-bridge

Can we visit each part of the city by crossing each bridge once?

Graph theory: the Seven Bridges of Königsberg


graph-theory

  • Eulerian path = visit every edge of the graph only once
  • In this problem it's impossible

de Bruijn graph


  • Reads are broken into k-mers (substrings of length k)
  • A de Bruijn graph is is constructed from the k-mers: k-mers are connected if they have k-1 shared bases
  • The genome is derived using the Eulerian path through the graph
  • E.G. assemblers: SPAdes, Velvet, ABySS

de Bruijn graph assembler


assembler

Graph features


graph features

Use k-mer frequency to resolve these graph features::

  • remove low depth kmers
  • clip tips, merge bubbles, remove links
  • resolve small repeats using long kmers

K-mer size


  • avoid using an even numbered k-mer size
    • they can lead to reverse complementing
    • affects the strand specificity of the graph
    • palindromic k-mers are avoided with an odd k
  • inceasing k-mer size can resolve ambiguities
    • higher k-mer size can < number of edges and < possible paths
    • however higher k-mer size also more sensitive to sequencing errors
    • higher k-mer size means more RAM needed
  • try several k-mer sizes to get the best assembly

Assembly quality


We assess quality by looking at the assembly contiguity, completeness and correctness

Contiguity


  • Ideally, we want a single complete chromosome
  • We measure contiguity using::
    • contig number
    • contig length (average, median and maximum)
    • N statistics (e.g. N50)
  • N50 is a statistical measure of the average length of a set of contigs
    • 50% of the entire assembly is contained in contigs with length >= the N50 value

Completeness and correctness


  • Completeness = assembled genome size / estimated genomes size
  • Correctness is a measure of the number of errors in the assembly
    • feature compressions (i.e. repeats)
    • improper contig scaffolding
    • introduced SNPs/InDels

Workshop


We will do: