07_Genome_assembly

By Yan Li

PhD in Bioinformatics, University of Liverpool

Introduction

coverage

\frac{N(number\space of\space reads) * L (read\space length)}

https:://katex.org/#demo

Term	Definition
assembly	reconstructing a genome sequence from raw reads
read	fragments of genome generated by a sequencer
coverage	the average number of reads that align to known reference bases
contig	a contigious sequence built from overlapping reads
scaffold	sets of overlapping contigs separated by gaps of known length
graph	represents relationships using nodes and edges

assembly

seven-bridge

Can we visit each part of the city by crossing each bridge once?

graph-theory

Reads are broken into k-mers (substrings of length k)
A de Bruijn graph is is constructed from the k-mers: k-mers are connected if they have k-1 shared bases
The genome is derived using the Eulerian path through the graph
E.G. assemblers: SPAdes, Velvet, ABySS

assembler

graph features

Use k-mer frequency to resolve these graph features::

avoid using an even numbered k-mer size
- they can lead to reverse complementing
- affects the strand specificity of the graph
- palindromic k-mers are avoided with an odd k
inceasing k-mer size can resolve ambiguities
- higher k-mer size can < number of edges and < possible paths
- however higher k-mer size also more sensitive to sequencing errors
- higher k-mer size means more RAM needed
try several k-mer sizes to get the best assembly

We assess quality by looking at the assembly contiguity, completeness and correctness

Ideally, we want a single complete chromosome
We measure contiguity using::
- contig number
- contig length (average, median and maximum)
- N statistics (e.g. N50)
N50 is a statistical measure of the average length of a set of contigs
- 50% of the entire assembly is contained in contigs with length >= the N50 value

Completeness = assembled genome size / estimated genomes size
Correctness is a measure of the number of errors in the assembly
- feature compressions (i.e. repeats)
- improper contig scaffolding
- introduced SNPs/InDels

We will do: