2014-09-17

New page: =Before we start= ==Teacher== Darek Kedra (darek dot kedra at gmail dot com) I work at CRG (Centre for Genomic Regulation) in Barcelona, Spain, in Cedric Notredame (author of t-coffee lig...

New page

=Before we start=

==Teacher==

Darek Kedra (darek dot kedra at gmail dot com)

I work at CRG (Centre for Genomic Regulation) in Barcelona, Spain, in Cedric Notredame (author of t-coffee ligner) group.

Our work place looks like this:

[http://pasteur.crg.es/portal/pls/portal/docs/1/7362.GIF CRG image]

My current research topics include RNASeq, genome assembly and annotation, SNP calling, lncRNA discovery. As for the parasites I am working with sequences from L.donovani, L.major and T.brucei.

==Organization etc.==

* lets turn out /silence the cell phones

* some exposure to command line interface / Unix is assumed (cut and paste of commands will be OK)

* in case of Unix command line problems check the page from the other course [[http://openwetware.org/wiki/Wikiomics:WinterSchool_day1#Introduction_to_Linux_and_the_command_line here]]

* this page will stay online after the course "forever", and we will improve it over time (old version will be accessible through History link on the top)

* you are welcome to register as a wiki user and comment in the Talk page/fix any errors/ ask for enhancements

* in the boxed parts of the page lines starting with # are comments, and what below are examples of commands. After typing equivalents of these in your terminal, press Enter

=Why Next Generation Sequencing =

Typical uses cases of NGS data include:

# de novo genome assembly

# mapping reads to a known genome

## whole genome random reads (DNA)

## targeted resequencing: regions selected by long range PCR/hybridisation

## exome mapping: DNA fragments are preselected using microarrays/beads with attached exonic sequences, so only fragments hybridising to them are selected

## RNASeq: just RNA, but specific protocols may target just 5- or 3-prime ends of transcripts or miRNA with small size selection. RNASeq is the application in which stranded libraries are used, i.e. to differentiate between sense and antisense transcripts

## ChIP-Seq: chromatin immunoprecipitation combined with sequencing of DNA fragments bound by histones/transcription factors.

## bisulfite sequencing (non-methylated cytosines converted to uracil, read as Ts on the non-methylated sequences)

#metagenomics: sequencing populations of microorganisms to investigate the diversity of species/strains

=NGS file formats overview=

There are multiple file formats used at various stages of NGS data processing. We can divide them into two basic types:

* text based (FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG)

* binary (BAM, BCF, SFF(454 sequencer data))

In principle, we can view and manipulate text based formats without special tools, but we will need these to access and view binary formats. To make things a bit more complicated, the text-based format are often compressed to save space making them de facto binary, but still easy to read by eye using standard Unix tools. Also despite that one can read values in several columns and from tens of rows, we still need dedicated programs to make sense of millions of rows and i.e. encoded columns.

On the top of these data/results files, some programs require that for faster access we need a companion file (often called index). See i.e.

* FASTA (.fa & .fai)

* BAM and BAI formats (suffixes .bam & .bai),

* VCF (.vcf & .vcf.idx).

The important thing for all files containing positions on chromosomes (mappings, features) is their numbering. To make things complicated, we have formats starting with 1, or with 0.

* 1 based formats: GFF/GFT, SAM/BAM, WIG,

* 0 BED, CN (copy number data, not covered)

Programs automatically deduce the correct numbering scheme, but whenever you are converting from one format to another watch for this "feature".

More info on file formats used by IGV:

http://www.broadinstitute.org/igv/?q=book/export/html/16

==Fasta (just few tricks)==

<pre>

#count number of sequences in multiple fasta

grep -c "^>" multi_fasta.fa

#list sequence names in fasta, display screen by screen with "q" to escape:

grep "^>" | less

</pre>

While most likely mature NGS programs will handle complex FASTA sequence names correctly, you may sometimes have problems with various scripts. To fix it and just keep the first, hopefully unique single word sequence name:

<pre>

!/usr/bin/env python

"""

name: fix_fasta_header.py

usage: ./fix_fasta_header.py input_fasta.fa > output_fasta.fa

CAVEAT: it does not check if the resulting sequence names are unique

"""

import sys

fn = sys.argv[1]

for line in open(fn).readlines():

if line[0] == ">":

line = line.split()[0]

print line

else:

print line,

</pre>

For reformatting sequences so that all sequences will have 70 columns you can use this program from a package called exonerate from: https://github.com/nathanweeks/exonerate

<pre>

fastareformat input.fa > output.fa

</pre>

For extracting sequences from your FASTA file you can use pyfasta library:

<pre>

#single contig/chromosome

pyfasta extract --header --fasta Lmex_genome.sort.fa LmxM.01 > LmxM.01_single.fa

#list of sequences in the desired order

#input seqids.txt file (just two lines with sequence names):

LmxM.01

LmxM.00

#command

pyfasta extract --header --fasta Lmex_genome.sort.fa --file seqids.txt > LmxM.two_contigs.fa

</pre>

==FASTQ==

===Format and quality encoding checks===

Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains nowadays 4 lines per sequence:

# sequence name (should be unique in the file)

# sequence string itself with ACTG and N

# extra line starting with "+" sign, which contained repeated sequence name in the past

# string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs

Here it is how it looks:

<pre>

@SRR867768.249999 HWUSI-EAS1696_0025_FC:3:1:2892:17869/1

CAGCAAGTTGATCTCTCACCCAGAGAGAAGTGTTTCATGCTAAGTGGCACTGGTGCAGAACAGTTCTGCAATGG

+

IHIIHDHIIIHIIIIIIHIIIDIIHGGIIIEIIIIIIIIIIIIGGGHIIIHIIIIIIBBIEDGGFHHEIHGIGE

</pre>

'''CAVEAT''': When planning to do SNP calling using GATK, do not try to modify part of the name describing location on the sequencing lane:

<pre>

HWUSI-EAS1696_0025_FC:3:1:2892:17869

</pre>

This part is used for looking for optical replicates.

Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters.

For the inquisitive:

http://en.wikipedia.org/wiki/FASTQ_format#Quality

What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.

There are two main quality encodings: Sanger and

Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:

* offset 33 == Sanger / Illumina 1.9

* offset 64 == Illumina 1.3+ to Illumina 1.7

For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:

<pre>

head -400000 my_reads.fastq > 100K_head_my_reads.fastq

fastqc 100K_head_my_reads.fastq

#we got here 100K_head_my_reads.fastq_fastqc/ directory

grep Encoding 100K_head_my_reads.fastq_fastqc/fastqc_data.txt

#output:

Encoding Sanger / Illumina 1.9

</pre>

'''CAVEAT''': all this works only on unfiltered FASTQ files. Once you remove the lower quality bases/reads containing them, guessing which encoding format is present in your files is problematic.

Here is a bash script containing awk oneliner to detect quality encoding in both gzip-ed and not-compressed FASTQ files.

<pre>

#!/bin/bash

file=$1

if [[ $file ]]; then

command="cat"

if [[ $file =~ .*\.gz ]];then

command="zcat"

fi

command="$command $file | "

fi

command="${command}awk 'BEGIN{for(i=1;i<=256;i++){ord[sprintf(\"%c\",i)]=i;}}NR%4==0{split(\$0,a,\"\");for(i=1;i<=length(a);i++){if(ord[a[i]]<59){print \"Offset 33\";exit 0}if(ord[a[i]]>74){print \"Offs

et 64\";exit 0}}}'"

eval $command

</pre>

===Types of data===

* read length

from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-150bp.

* single vs paired

Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.

Most of the time forward and reverse reads facing each other end-to-end are

* insert length

With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.

Program for combining overlapping reads:

FLASH: http://ccb.jhu.edu/software/FLASH/

To run it:

<pre>

flash SRR611712_1.fastq SRR611712_2.fastq -o SRR611712_12.flash

</pre>

For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.

* stranded vs unstranded (RNASeq only)

We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.

===quality checking (FastQC)===

It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.

* good quality report:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

* bad quality FastQC report

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

===trimming & filtering ===

Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters.

All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.

You can read more about quality trimming for genome assembly in the two blog posts by Nick Loman:

http://pathogenomics.bham.ac.uk/blog/2013/04/adaptor-trim-or-die-experiences-with-nextera-libraries/

====Trimmomatic====

web: http://www.usadellab.org/cms/index.php?page=trimmomatic

version (Sep 2014): 0.32

From the manual:

Paired End:

<pre>

java -jar /path/to/trimmomatic-0.32.jar PE --phred33 \

input_forward.fq.gz input_reverse.fq.gz \

output_forward_paired.fq.gz output_forward_unpaired.fq.gz \

output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \

ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

</pre>

This will perform the following:

Remove adapters

Remove leading low quality or N bases (below quality 3)

Remove trailing low quality or N bases (below quality 3)

Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15

Drop reads below the 36 bases long

Single End:

<pre>

java -jar /path/to/trimmomatic-0.32.jar SE --phred33 \

input.fq.gz output.fq.gz \

ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

</pre>

This will perform the same steps, using the single-ended adapter file

====Tagdust (for simple unpaired reads)====

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.

<pre>

tagdust -o my_reads.clean.out.fq -a my_reads.artifact.out.fq adapters.fasta my_reads.input.fq

</pre>

===Error correction===

For some applications, like de novo genome assembly, one can correct the sequencing errors in the reads by comparing them with other reads with almost identical sequence. One of the programs which do perform this and are relatively easy to install and make it running is Coral.

====Coral====

web site: http://www.cs.helsinki.fi/u/lmsalmel/coral/

version: 1.4

It requires large RAM machine for correcting individual Illumina files (run it on 96GB RAM).

<pre>

#Illumina reads

./coral -fq input.fq -o output.fq -illumina

#454 reads

./coral -fq input.454.fq -o output.454.fq -454

#correcting 454 reads with Illumina reads

Coral can not use more than one input file, therefore one has to combine Illumina & 454 reads into one FASTQ file, noting the number of Illumina reads used. To prepare such file:

cat 10Millions_illumina_reads.fq > input_4_coral.fq

cat some_number_of_454_reads.fq >> input_4_coral.fq

## run coral with the command:

coral -454 -fq input_4_coral.fq -o input_4_coral.corrected.fq-i 10000000 -j 10000000

#This will correct just the 454 reads,not the Illumina ones

#In real life you have to count the number of reads in your Illumina FASTQ file, i.e. (assuming you do not have wrapped sequence/qualities FASTQ 1 read=4lines ) :

wc -l illumina_reads.fq | awk '{print $1/4}'

#if in doubt, use fastqc to get the numbers

</pre>

===Public FASTQ data: Short Read Archive vs ENA===

While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies.

There are two main places to get such data sets:

* NCBI Short Read Archive / Taxonomy Browser:

** SRA: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj

For both of them it is a good idea to install program Aspera Connect to reduce download times:

* http://asperasoft.com/software/transfer-clients/connect-web-browser-plug-in/

Click on Resources, download and install the web plugin.

<pre>

* go there

* put Leishmania major RNA

* we get "SRA Experiments 6"

* click on "6"

* one of them is: "Transcriptome Analysis of Leishmania major strain Friedlin", ERX190352

</pre>

*Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

<pre>

* go there

* put Leishmania major

* click on the checkbox SRA Experiments

* click on Display button

* we got 58 public experiment, 7of which are RNA

* click on RNA (7) on the left

* note

"Whole Genome Sequencing of Leishmania major strain Friedlin"

Accession: SRX203187

CAVEAT: not everybody submits the right descriptions of their experiments. In case of doubt, download and map.

</pre>

Extracting the FASTQ from SRA archive is easy with sra-toolkit installed on your computer

You get it here: https://github.com/ncbi/sratoolkit

<pre>

fastq-dump SRR1016916.sra

#Result:

SRR1016916.fastq

</pre>

You can also get the multiple SRA files in one step and without 20 clicks from NCBI with configured ASPERA by writing a shell script (one line per file), preferably using a (Python/Perl/Ruby) script and a list of files to get:

<pre>

ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \

anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012154/ ./

ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \

anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012155/ ./

</pre>

Which one to use? ENA may be easier as you get gzipped fastq files directly. But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.

==SAM and BAM file formats==

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. Numbering starts from base 1, contains header, and it is line oriented (one sequence per line).

The minimal version looks like this:

<pre>

@HD VN:1.0 SO:unsorted

@SQ SN:1 LN:171001

@PG ID:bowtie2 PN:bowtie2 VN:2.1.0

</pre>

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

<pre>

#unmapped reads:

SRR594632.1 4 * 0 0 * * 0 0 NTGAGGACAAAGCTCCGCTGCTGCTCGCTTTCCCCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

SRR594632.2 4 * 0 0 * * 0 0 NTGAGCGAACTATTGATATTCGGCATAGAACAAAAA BQKKHNMKOR___T_TTRWWVTVTVVTTOV___QQQ

SRR594632.3 4 * 0 0 * * 0 0 NTGTTATTCAATCTGAGCATCGTTATCTTGACCATG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

SRR

#mapped reads

SRR594632.20 0 LmxM.20 3139684 25 36M * 0 0 NTGAAAATAAAGAACATCTGAACATTTCTCTGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBXT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0A0A34

SRR594632.21 16 LmxM.22 312482 25 36M * 0 0 ATACAAGGCACAAAAATAAAGCAGTCGAGAGAGCAN BBBBBBBBBBBBBBBB__________RRRRRLLLQBXT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:34T0G0

SRR594632.22 16 LmxM.01 33973 25 36M * 0 0 AACACTGTGCGTGTGCGCTTGTCTAAGATGAGCCAN _WTRWWTOTVTTVPTT__________PPLOPJJJKBXT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:34T0A0

</pre>

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf

Use BAMs instead of SAM files (speed of access, size, compatibility with tools.There are multiple tools to process and extract information from SAM and its compressed form, BAM files. The few used on daily/basis:

* samtools

* Picard

* bamtools

* bedtools

Typical tasks:

<pre>

#view the BAM file

samtools view your_bam | less

#view the header of BAM file

samtools view -H your_bam | less

#sort the BAM/SAM file

java -jar /path/to/SortSam.jar I=mapped_reads.unsorted.sam O=mapped_reads.sorted.bam \

SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

#merge (sorted) BAM files:

java -jar /path/to/SortSam.jar \

I=mapped_reads.set1.bam \

I=mapped_reads.set2.bam \

I=mapped_reads.set3.bam \

SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true ASSUME_SORTED=true

#extracting as BAM reads mapped to certain chromosome (LmxM.01):

bamtools filter -in ERR307343_12.Lmex.bwa_mem.Lmex.bwa.sorted.bam \

-out chr_1_ERR307343_12.Lmex.bwa_mem.Lmex.bam -region LmxM.01

</pre>

==GTF/GFF==

The most commonly used sequence annotation format with few flavors.

GTF Fields:

<pre>

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.

source - name of the program that generated this feature, or the data source (database or project name)

feature - feature type name, e.g. Gene, Variation, Similarity

start - Start position of the feature, with sequence numbering starting at 1.

end - End position of the feature, with sequence numbering starting at 1.

score - A floating point value.

strand - defined as + (forward) or - (reverse).

frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..

attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

</pre>

Sample GFF output from Ensembl export:

<pre>

X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21

X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303

X Ensembl Repeat 2419108 2419128 0 . . hid=dust; hstart=2419108; hend=2419128

X Ensembl Pred.trans. 2416676 2418760 450.19 - 2 genscan=GENSCAN00000019335

X Ensembl Variation 2413425 2413425 . + .

X Ensembl Variation 2413805 2413805 . + .

</pre>

And from TriTrypDB (gff3)

<pre>

LmjF.01 TriTrypDB CDS 12073 12642 . - 0 ID=cds_LmjF.01.0040-1;Name=cds;description=.;size=570;Parent=rna_LmjF.01.0040-1

LmjF.01 TriTrypDB exon 12073 12642 . - . ID=exon_LmjF.01.0040-1;Name=exon;description=exon;size=570;Parent=rna_LmjF.01.0040-1

LmjF.01 TriTrypDB gene 15025 17022 . - . ID=LmjF.01.0050;Name=LmjF.01.0050;description=carboxylase%2C+putative;size=1998;web_id=LmjF.01.0050;locus_tag=LmjF.01.0050;size=1998;Alias=321438056,389592315,LmjF1.0050,LmjF01.0050,LmjF.01.0050,LmjF01.0050:pep,LmjF01.0050:mRNA

</pre>

The detailed description of GFF3:

http://www.sequenceontology.org/gff3.shtml

'''IGV-centered CAVEAT''':

GFF files downloaded from various sources often contain more annotations than a single track of IGV can handle. Typical offenders are chromosome/scaffold lines, causing non-visibility of gene annotations.

Since the authors are quite inventive in naming their tracks, plus key words like "chromosome" may be in column 9, you can:

<pre>

#see what is there:

grep -v "^#" TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff | awk '{print $3}' | sort | uniq -c | sort -n

#result TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff:

1 random_sequence

16 rRNA

34 chromosome

83 tRNA

714 transcript

8250 mRNA

8336 CDS

9063 gene

9149 exon

908081

</pre>

As you can see, we can try to remove just 34+1 (chromosome + random_sequence) and check that we will get 908081-35 entries when running it the same command on the result. Just remember that we removed the header, which we may want to keep.

==BED==

Another text annotation file format from UCSC, compatible with its Genome Browser:

http://genome.ucsc.edu/cgi-bin/hgGateway

Full format description:

http://genome.ucsc.edu/FAQ/FAQformat.html#format1

'''CAVEAT''': first base in contig/chromosome is numbered as 0

* Essential 3 columns

1: '''chrom''' - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2: '''chromStart''' - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

3: '''chromEnd''' - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

<pre>

#just regular genomic intervals for i.e GC calculations

LmxM.01 0 10000

LmxM.01 10000 20000

LmxM.01 20000 30000

LmxM.01 30000 40000

</pre>

* Often used next 3 columns

4: '''name''' - Defines the name of the BED line. I

5: '''score''' - A score between 0 and 1000.

6: '''strand''' - Defines the strand - either '+' or '-'.

<pre>

#just regular genomic intervals for i.e GC calculations

LmxM.01 210 720 Rep1 100 -

LmxM.01 800 1234 ABC|cds 1000 +

LmxM.01 2100 3030 DEF|gen 100 +

</pre>

* Last 6 columns for extended features

7: thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.

8: thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).

9: itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.

10: blockCount - The number of blocks (exons) in the BED line.

11: blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.

12: blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

==VCF==

Stands for Variant Call Format. Text file format for storing information about SNPs, insertions and deletions.

It is rather complex, see detailed description:

http://www.1000genomes.org/node/101

Obtaining variants listed in such form is a multistep procedure, but well standardised. See GATK section below.

Here are the examples from our data:

<pre>

#simple SNP

LmxM.01 17218 . G A 1818.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=2.761;DP=55

#insertion

LmxM.01 38177 . C CTG 367.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.082;DP=34

#deletion

LmxM.01 99318 . GCACACACA G 1774.73 . AC=2;AF=1.00;AN=2;DP=30

</pre>

For using it with IGV, we have to have VCF index, created by GATK.

==WIG==

A rather line oriented format for storing and displaying numerical information along the genome for intervals. The generic form of WIG:

<pre>

line defining type of this section, chr_position, start, step and more

number1

number2

</pre>

The intervals can be regular, as when we compute some measure across the whole genome. This is '''fixedStep''' type:

<pre>

fixedStep chrom=LmxM.01 start=0 step=10000 span=10000

0.572100

0.612700

0.637300

0.628000

0.637200

0.643700

0.642800

</pre>

Above fixedStep WIG is from GC calculations in a given window in the genome.

Procedure described here (we start from "create a N-base wide step file":

http://wiki.bits.vib.be/index.php/Create_a_GC_content_track

When the calculated number different from 0 are sparse in our genomic intervals, it is simpler to use '''variableStep'''

<pre>

variableStep chrom=LmxM.17 span=41

1 0.142857

variableStep chrom=LmxM.17 span=249

43 0.142857

variableStep chrom=LmxM.17 span=4

292 0.166667

variableStep chrom=LmxM.17 span=7

296 0.142857

</pre>

This variableStep example comes from gem-mapability program from GEM-Mapper.

<pre>

#index the genome (fix FASTA headers first to one word sequence names)

gem-indexer -i Lmex_genome.fa -o Lmex_genome

#calculate mapability for 100bp fragments

gem-mappability -I Lmex_genome.gem -o Lmex_genome -l 100

#convert to WIG

gem-2-wig -I Lmex_genome.gem -i Lmex_genome.mappability -o Lmex_genome.mappability

#Result (loadable in IGV):

Lmex_genome.mappability.wig

</pre>

Show more