08_Identify_variations


By Yan Li

PhD in Bioinformatics, University of Liverpool

Why look at variation


  • allows us to study evolution
    • phylogenetics
    • population genomics
  • let us make genotype / phenotype associations
    • genome wide association studies (GWAS)
    • human disease, agriculture, genetic engineering

The personal gene test service


wegene_1

wegene_2

wegene_3

wegene_4

What is variation


  • Single Nucleotide Polymorphisms (SNPs)
    • a genetic "typo" of one nucleotide (e.g. A > G)
  • INsertion / DELetions (INDELs)
    • a string of one or more nucleotides that has been added/removed from a location in a genome (typically 1-100bp)
  • Structural Variants (SVs)
    • a region of DNA that has been inverted / translocated / duplicated (typically >100bp)
  • Mobile Genetic Elements (MGEs)
    • insertion / replication of retrotransposons, transposons, integrons etc.

Workflow - GATK best practice


gatk

refer to GATK website

Shors reads allignment


  • bwa (Burrows-Wheeler Aligner)
  • Bowtie

Sam/Bam files


  • SAM file is a TAB-delimited, line-oriented text format, including a header section and an alignment section
    • Header section: each line contains some metadata
    • Alignment section: each line contains the alignment of a read
    • The SAM tag specification
  • BAM file is the compressed format of SAM file

An example of SAM file


@HD     VN:1.6  SO:coordinate
@SQ     SN:NC_003197.2  LN:4857450
@SQ     SN:NC_003277.2  LN:93933
@RG     ID:foo  PL:illumina     SM:SRR1056117
SRR10561173.6336        2209    NC_003197.2     1       60      57H44M  =       549     649     AGAGATTACGTCGGGTTGCAAGAGATCTTGACAGGGGGAATTGG    .G...G...GAA.<A.<A.GGGAGAGA.<..G...<GAA.GGAG    SA:Z:NC_003197.2,4857394,+,57M44S,60,0; MC:Z:101M       MD:Z:12T14A16   PG:Z:MarkDuplicates     RG:Z:foo        NM:i:2  AS:i:34 XS:i:0
SRR10561173.114060      163     NC_003197.2     1       60      45S56M  =       178     275     GAAAAAAAACTAACAAAATAACGTGCTGTAATTTTTAAAATAATAAGAGATTACGTCTGGTTGCACGAGATCATGACAGGGGGAATTGGTTGAAAATAAAT   GGAA<<<A.<<AGGGGAAGAGG.<<G.<<<G.GGAA..<GGA..<...AA.<AGA<<AG.GGGAG.<.<.<.<<GG.GAGG...G.<G.....<<.<.GGG   SA:Z:NC_003197.2,4857406,+,45M56S,60,0; MC:Z:98M        MD:Z:20A35      PG:Z:MarkDuplicates     RG:Z:foo        NM:i:1  AS:i:51 XS:i:0

Call variants


  • bcftools
  • FreeBayes
  • GATK
    • GATK v3.8 and v4.0 have some differences.
    • v4.0 merged the picard tools
  • VarScan2

VCF files


vcf files

Annotate SNPs


  • snpeff
  • Annovar
  • VAAST 2