2014-09-14

SNP discovery (GATK):

New page

=EMBO Tunis 2014=

From sequencing data to knowledge

== 00 Programs used ==

===sequence pre-processing===

* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=std SRA_toolkit] ver current

* [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQC] ver 0.11.2

* [http://www.usadellab.org/cms/index.php?page=trimmomatic Trimmomatic ] ver 0.32

* [http://sourceforge.net/projects/tagdust/ TagDust] ver 2.13

* [http://www.cs.helsinki.fi/u/lmsalmel/coral/ Coral] ver 1.4

=== general tools ===

* [http://hannonlab.cshl.edu/fastx_toolkit/ fastx_toolkit] ver 0.0.13

* [http://samtools.sourceforge.net/ Samtools classic] ver 0.1.19

* [http://www.htslib.org/ samtools/HTSlib] ver 1.0

* [http://picard.sourceforge.net/ Picard ] ver 1.119

=== mappers ===

* [http://bio-bwa.sourceforge.net/ BWA] ver 0.7.10

* [http://last.cbrc.jp/ LAST] ver 475

* [http://www.well.ox.ac.uk/~gerton/software/Stampy/ Stampy] stampy-1.0.23r2059.tgz (optional)

=== Splice reader mappings ===

* [https://github.com/indraniel/fqgrep fqgrep] Github version plus

* [https://github.com/laurikari/tre/ TRE_library] ver 0.80

=== viewers===

* [http://www.broadinstitute.org/igv/home IGV] ver 2.3.34

* [http://www.broadinstitute.org/igv/download igvtools] ver 2.3.32

=== quantification ===

* [http://ngsutils.org/NGSUtils NGSUtils] ver 0.5.6

* [https://pypi.python.org/pypi/HTSeq HTSeq] ver 0.6.1 (requires NumPy)

=== SNPs discovery ===

* [https://www.broadinstitute.org/gatk/ GATK] ver 3.2-2

==01 Data files used ==

===FASTQ files ===

====L.amazonensis RNA-Seq ====

* http://www.ebi.ac.uk/ena/data/view/SRP016502

====L mexicana genomic DNA ====

* http://www.ebi.ac.uk/ena/data/view/ERX280624

==== (extra set) L.enriettii genomic DNA ====

* http://www.ebi.ac.uk/ena/data/view/SRR835620

== Stuff to read / compare ==

!!! Important: introduce zero- and 1-based positioning of file formats !!!

===File formats ===

* http://biobits.org/samtools_primer.html (file formats)

* http://www.slideshare.net/lindenb/ngsformats

* http://wiki.bits.vib.be/index.php/Next-generation_sequencing

=== VCF ===

* http://www.1000genomes.org/node/101 (VCF)

* http://en.wikipedia.org/wiki/Variant_Call_Format

* http://vcftools.sourceforge.net/ (VCFTools)

=== BED ===

* http://genome.ucsc.edu/FAQ/FAQformat.html#format1

* http://www.broadinstitute.org/igv/BED

* http://www.ensembl.org/info/website/upload/bed.html

* http://bedtools.readthedocs.org/en/latest/ BEDTOOLS

=== GFF / GTF ===

* http://www.ensembl.org/info/website/upload/gff.html

* http://www.sequenceontology.org/gff3.shtml

==Genomes and annotations ==

* L mexicana

http://tritrypdb.org/common/downloads/release-8.0/LmexicanaMHOMGT2001U1103/fasta/data/TriTrypDB-8.0_LmexicanaMHOMGT2001U1103_Genome.fasta

http://tritrypdb.org/common/downloads/release-8.0/LmexicanaMHOMGT2001U1103/gff/data/TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff

* L.amazonensis

http://tritrypdb.org/common/downloads/release-8.0/LamazonensisMHOMBR71973M2269/fasta/data/TriTrypDB-8.0_LamazonensisMHOMBR71973M2269_Genome.fasta

* L.enriettii

http://tritrypdb.org/common/downloads/release-8.0/LenriettiiLEM3045/fasta/data/TriTrypDB-8.0_LenriettiiLEM3045_Genome.fasta

* L.major

http://tritrypdb.org/common/downloads/release-8.0/LmajorFriedlin/fasta/data/TriTrypDB-8.0_LmajorFriedlin_Genome.fasta

==NGS file formats overview==

There are multiple file formats used at various stages of NGS data processing. We can divide them into two basic types:

* text based (FASTQ, SAM, VCF, GTF/GFF, BED

* binary (BAM, BCF, SFF(454 sequencer data, not covered here))

In principle, we can view and manipulate text based formats without special tools, but we will need these to access and view binary formats. To make things a bit more complicated, the text-based format are often compressed to save space making them de facto binary, but still easy to read by eye using standard Unix tools. Also despite that one can read values in several columns and from tens of rows, we still need dedicated programs to make sense of millions of rows and i.e. encoded columns.

On the top of these data/results files, some programs require that for faster access we need a companion file (often called index). See BAM and BAI formats.

!!! find expand index files !!!

* FASTQ

* SAM

* BAM,

* VCF,

* GTF/GFF

* BED

==FASTQ==

===Format and quality checks===

Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains 4 lines per sequence:

# sequence name (should be unique in the file)

# sequence string itself with ACTG and N

# extra line starting with "+" sign, which contained repeated sequence name in the past

# string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs

Here it is how it looks:

<pre>

@SRR867768.249999 HWUSI-EAS1696_0025_FC:3:1:2892:17869/1

CAGCAAGTTGATCTCTCACCCAGAGAGAAGTGTTTCATGCTAAGTGGCAGTTTCTGGTGCAGAACAGTTCTGCAATGAGGGAGGAGGCAGAAAACATAAGTGTGTAATAAGGCAACCTGC

+

IHIIHDHIIIHIIIIIIHIIIDIIHGGIIIEIIIIIIIIIIIIGGGHIIIHIIIIIIBBIEDGGFHHEIHGIGEGHEBCHDBFC>CBCCECEEAAAAEEE:B@B@BBB;B;@;@BAE@A@

</pre>

Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters.

For the inquisitive:

http://en.wikipedia.org/wiki/FASTQ_format#Quality

What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.

There are two main quality encodings: Sanger and

Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:

* offset 33 == Sanger / Illumina 1.9

* offset 64 == Illumina 1.3+ to Illumina 1.7

For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:

<pre>

head -400000 my_reads.fastq > 100K_head_my_reads.fastq

fastqc 100K_head_my_reads.fastq

#we got here 100K_head_my_reads.fastq_fastqc/ directory

grep Encoding 100K_head_my_reads.fastq_fastqc/fastqc_data.txt

#output:

Encoding Sanger / Illumina 1.9

</pre>

'''CAVEAT: all this works only on unfiltered FASTQ files.'''. Once you remove the lower quality bases/reads containing them, guessing which encoding format is present in your files is problematic.

Here is a bash script containing awk oneliner to detect quality encoding in both gzip-ed and not-compressed FASTQ files.

<pre>

#!/bin/bash

file=$1

if [[ $file ]]; then

command="cat"

if [[ $file =~ .*\.gz ]];then

command="zcat"

fi

command="$command $file | "

fi

command="${command}awk 'BEGIN{for(i=1;i<=256;i++){ord[sprintf(\"%c\",i)]=i;}}NR%4==0{split(\$0,a,\"\");for(i=1;i<=length(a);i++){if(ord[a[i]]<59){print \"Offset 33\";exit 0}if(ord[a[i]]>74){print \"Offs

et 64\";exit 0}}}'"

eval $command

</pre>

===Types of data===

* read length

from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-120bp.

* single vs paired

Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.

Most of the time forward and reverse reads facing each other end-to-end are

* insert length

With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.

Program for combining overlapping reads:

FLASH: http://ccb.jhu.edu/software/FLASH/

For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.

* stranded vs unstranded (RNASeq only)

We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.

===quality checking (FastQC)===

It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.

* good quality report:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc/fastqc_report.html

* bad quality FastQC report

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc/fastqc_report.html

===trimming & filtering ===

Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters.

All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.

You can read more about quality trimming for genome assembly in the two blog posts by Nick Loman:

http://pathogenomics.bham.ac.uk/blog/2013/04/adaptor-trim-or-die-experiences-with-nextera-libraries/

====Trimmomatic====

http://www.usadellab.org/cms/index.php?page=trimmomatic

From the manual:

Paired End:

<pre>

java -jar trimmomatic-0.30.jar PE --phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

</pre>

This will perform the following:

Remove adapters

Remove leading low quality or N bases (below quality 3)

Remove trailing low quality or N bases (below quality 3)

Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15

Drop reads below the 36 bases long

Single End:

<pre>

java -jar trimmomatic-0.30.jar SE --phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

</pre>

This will perform the same steps, using the single-ended adapter file

====Tagdust (for simple unpaired reads)====

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.

<pre>

tagdust -o my_reads.clean.out.fq -a my_reads.artifact.out.fq adapters.fasta my_reads.input.fq

</pre>

===Error correction===

For some applications, like de novo genome assembly, one can correct the sequencing errors in the reads by comparing them with other reads with almost identical sequence. One of the programs which do perform this and are relatively easy to install and make it running is Coral.

====Coral====

web site: http://www.cs.helsinki.fi/u/lmsalmel/coral/

version: 1.4

It requires large RAM machine for correcting individual Illumina files (run it on 96GB RAM)

<pre>

#Illumina reads

./coral -fq input.fq -o output.fq -illumina

#454 reads

./coral -fq input.454.fq -o output.454.fq -454

</pre>

===source of published FASTQ data: Short Read Archive vs ENA===

While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies.

There are two main places to get such data sets:

* NCBI Short Read Archive / Taxonomy Browser:

** SRA: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj

<pre>

go there

put mouse RNASeq

417 public access sets

Click on it,

it looks like we got just: RNA (348)

</pre>

*Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

<pre>

Go there

put Bos taurus

see on the right table SRA Experiments 636

on the left:

Source

DNA (171)

RNA (454)

metagenomic (13)

</pre>

* European Nucleotide Archive

http://www.ebi.ac.uk/ena/

<pre>

Go there:

put RNAseq

see Experiment (5)

put RNA-seq

see Experiment (109)

</pre>

Which one to use? ENA is easier as you get gzipped fastq files not SRA archives requiring extra processing, sometimes painful (at one stage the funding for SRA programs was cut). But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.

==SAM and BAM file formats==

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. The minimal version looks like this:

<pre>

@HD VN:1.0 SO:unsorted

@SQ SN:1 LN:171001

@PG ID:bowtie2 PN:bowtie2 VN:2.1.0

</pre>

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

<pre>

SRR197978.9007654 177 1 189 0 50M 12 19732327 0 CAGATTTCAGTAGTCTAAACAAAAACGTACTCACTACACGAGAACGACAG 5936A><<4=<=5=;=;?@<?BA@4A@B<AAB9BB;??B?=;<B@A@BCB XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:58 XM:i:3 XO:i:0 XG:i:0 MD:Z:0A0G0T47

SRR197978.9474929 69 1 238 0 * = 238 0 GAGAAAAGGCTGCTGTACCGCACATCCGCCTTGGCCGTACAGCAGAGAAC B9B@BBB;@@;::@@>@<@5&+2;702?;?.3:6=9A5-124=4677:7+

SRR197978.9474929 137 1 238 0 50M = 238 0 GTTAGCTGCCTCCACGGTGGCGTGGCCGCTGCTGATGGCCTTGAACCGGC B;B9=?>AA;?==;?>;(2A;=/=<1357,91760.:4041=;(6535;% XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:62 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

</pre>

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf

There are multiple tools to process and extract information from SAM and its compressed form, BAM files, so it is better to learn how to use them than decipher it and access with often slow scripts.

==Mapping Illumina reads to the genome ==

=== basic mapping steps===

* indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

* mapping

This is often the longest step, with options specific for each mapper

* postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.

===bwa===

BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own.

Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

* Illumina reads up to 100bp: bwa-backtrack (the legacy bwa)

* sequences from 70bp up to 1Mbp:

There are two algorithms for these: BWA-SW (Smith Waterman) and BWA-MEM(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

Please note that BWA-SW requires different algorithm for indexing the genome. The default indexing algorithm is called IS.

<pre>

#creating genome index

bwa index -p ref.bwa_is ref.fa

#mapping single end reads using MEM algorithm

bwa mem ref.bwa_is reads.fq > reads.bwa_mem.sam

#mapping paired end reads using MEM algorithm

bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads

bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai

bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads

bwa aln ref.bwa_is short_read_1.fq > short_read_1.bwa_aln.sai

bwa aln ref.bwa_is short_read_2.fq > short_read_2.bwa_aln.sai

bwa sampe ref.bwa_is short_read_1.bwa_aln.sai short_read_2.bwa_aln.sai short_read_1.fq short_read_2.fq > short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm

bwa index -p ref.bwa_sw -a bwtsw ref.fa

bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

</pre>

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

<pre>

java -jar /path/to/SortSam.jar I=reads_vs_reference.bwa.unsorted.sam O=reads_vs_reference.bwa.sorted.bam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

</pre>

For subsequent processing the mapping files with GATK (SNP calling) it is easier to introduce necessary information at the mapping stage, than run an extra step using picard. What is required by GATK is so called reads group info. We will cover it later, but at this stage is good to know that bwa can be run with extra parameters saving us one extra step.

<pre>

#below is the example read group info needed to be passed to bwa on the command line:

@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1

#here is the mapping step where in the place of string in <> we put group info from above.

#different samples should have different group info, like this:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' ref_gen.bwa_is chicken_genomic_short_1.fq chicken_genomic_short_2.fq > chicken_genomic_12_vs_refgen.bwa_mem.rg.sam

</pre>

==== (optional) Stampy ====

Stampy is a quite slow but at times more accurate mapper, allowing for improvement over simple BWA mappings. The basic usage is as follows:

<pre>

#creating two special index files

stampy.py --species=chicken --assembly=ens73_toy -G ens73_toygenome ref_gen.fa

stampy.py -g ens73_toygenome -H ens73_toy

#remapping reads already mapped with BWA (prefered option)

stampy.py -g ens73_toygenome -h ens73_toy -t2 --bamkeepgoodreads -M ggal_test_1_vs_ref_gen.bwa_aln.bam > ggal_test_1_vs_ref_gen.stampy.sam

</pre>

===last===

web site: http://last.cbrc.jp/

current version: 475 (Sep 2014)

This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping.

It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently).

Standard usage:

<pre>

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai

samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches

lastdb -m1111110 ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info

lastal -Q1 -e120 ref_genome.lastdb input_reads.fastq | last-map-probs.py -s150 > input_reads_vs_ref_genome.last.maf

#convert from MAF to SAM format

maf-convert.py sam input_reads_vs_ref_genome.last.maf > input_reads_vs_ref_genome.last.sam

#convert SAM to BAM inserting header

samtools view -but ref_genome.fa.fai input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ref_genome.last.unsorted.bam

#sort BAM

samtools sort input_reads_vs_ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)

samtools index input_reads_vs_ref_genome.last.sorted.bam

</pre>

===Quick and dirty genome 2 genome comparison using LAST===

* Comparing 2-3 Leishmania genomes

==Viewing mapping results with IGV==

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome annotation or even genome-2-genome alignment (not a typical usage).

The order of steps:

* start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files opened at the same time. In short, more RAM assigned, faster scrolling.

* select exact the same version of the genome with contigs named also the same way as in your BAM files

* open BAM files (need to be sorted and indexed), plus any annotation you may need.

To work with IGV:

<pre>

#assuming that igv.sh is on the PATH like in vagrant:

igv.sh

</pre>

We have a large number of genomes available through IGV pull down menus, but we may need to create our own genome for viewing (top menu):

Genomes > Create .genome file

We need to have FASTA genome reference file and its index (ref_gen.fa, ref_gen.fa.fai). The later one (FASTA index) we create as follows:

<pre>

samtools faidx ref_gen.fa

</pre>

In the IGV new genome creation menu we have to put unique names for our genome, then FASTA , and if we have it, also genome annotation file.

So now we will be ready to load some BAM files mapped to our ref_gen sequence.

There are multiple options to display the reads. The important thing to notice are mismatches in reads, the coverage track, and paired display.

===SNP discovery (GATK)===

!!! Update to 1.119 !!!

====Prerequisites ====

* reference genome

* reference genome dictionary

<pre>

java -jar ~/soft/picard_1.118/CreateSequenceDictionary.jar R=Lmex_genome.fa O=Lmex_genome.fa.dict

</pre>

* BAM file preferably bwa mapped

CAVEAT:

GATK requires meta information in BAM fiiles to work. This information is often not there after performing the mappings. We have to add this to BAM file using Picard:

<pre>

java -jar ~/soft/picard_current/AddOrReplaceReadGroups.jar \

I=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.bam \

O=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.grp.bam \

RGID=ERR307343 RGPL=Illumina RGLB=ERR307343_lib RGPU=ERR307343_pu RGSM=ERR307343_sm

</pre>

===GATK pipeline===

Genome Analysis Toolkit (GATK) from Broad is the de facto standard for detecting Single Nucleotide Polymorphisms (SNPs). There are very good and extensive manuals available on their site: http://www.broadinstitute.org/gatk/index.php

This is the step by step procedure to follow their best practice.

'''Caveat: GATK requires that you have more than one sequence in your reference genome'''. If not, it reports a strange error about wrong IUPAC (sequence character).

<pre>

#Create index and dictionary for reference

samtools faidx ref_gen.fa

java -jar ~/soft/picard-tools-1.101/CreateSequenceDictionary.jar R=ref_gen.fa O=ref_gen.dict

</pre>

It is essential that we do have group info included in our BAM files (we assume that these have been already sorted and indexed). If we have not done it during the mapping with bwa, we can still fix it easily with AddOrReplaceReadGroups from picard:

<pre>

#mysample in the read group info should be replaced by some mnemonic describing the #experiment/sample. Shortened file name, or SRA file prefix, like SRR197978 are a good choices.

java -jar ~/soft/picard-tools-1.101/AddOrReplaceReadGroups.jar \

I=chicken_genomic_12_vs_refgen.bwa_mem.bam \

O=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam \

RGLB=mysampleLB RGPL=Illumina RGPU=mysamplePU RGSM=mysampleSM \

VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE SO=coordinate

</pre>

At this stage we have mapped reads with group info as BAM. The next step is to mark duplicate reads (~PCR artifacts) in this file. We can almost always use CREATE_INDEX=true, so we do not need to run extra indexing when using some picard utilities

<pre>

java -jar ~/soft/picard-tools-1.101/MarkDuplicates.jar I=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam O=chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam METRICS_FILE=metrics.file CREATE_INDEX=true

</pre>

After getting marked duplicated reads, the next step is to realign read around indels. This is being done in two steps. Also at this stage it becomes more and more cumbersome to execute these steps as commands on the command line. The solution is to cut and past them into scipt files, then change the script permission and execute them instead.

<pre>

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T RealignerTargetCreator \

-R ref_gen.fa \

-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \

-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T IndelRealigner \

-R ref_gen.fa \

-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \

-targetIntervals chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list \

-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam

</pre>

The recommended Best Practices step here is to run base recalibration, meaning that the base quality is being reestimated after taking into account mapping results. It adds several steps, and while it may be worthwhile, Illumina got better at estimating base qualities of it reads, so the results may not justify the extra complexity.

Another optional (by Best Practices) step is to reduce the complecity of the BAM. Since it is not necessary, we will skip it this time, but it is recommended to run it when dealing with multiple / large data sets.

Finally:

<pre>

java -Xmx240G -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ref_gen.fa -I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam -o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam.gatk.vcf

</pre>

===Quantifications of mapped reads===

* Gene quantifications (DNA & RNA levels)

===Finding gene ends by mapping post-splice leader and polyA sequences===

=== Mapping Illumina reads using LAST===

===Viewing mappings and SNPs===

Show more