One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.
When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
1. What marts and datasets are available?
The function listMarts() returns a data frame with mart names and descriptions.
The first 4 of those are what you would see if you visited Ensembl BioMart on the Web and clicked the “choose database” drop-down box.
Given a mart object, listDatasets() returns the available datasets. We get a mart object using useMart().
2. How to specify a different genome build version?
Assuming that you want human genes in the most recent Ensembl build, you supply mart and dataset information from the previous step to useMart() like this:
What if you want an older genome build? First, visit the Ensembl Archives page for more information. Next, use the URL information there to supply a host = argument. For example, to see what marts are available for Ensembl version 72 (genome build GRCh37.p11):
Available datasets can be found as before with the addition of the host = argument to useMart(). To create the version 72 mart object:
3. What filters and attributes can I use?
Let’s review the BioMart terminology.
Attributes are the identifiers that you want to retrieve. For example HGNC gene ID, chromosome name, Ensembl transcript ID.
Filters are the identifiers that you supply in a query. Some but not all of the filter names may be the same as the attribute names.
Values are the filter identifiers themselves. For example the values of the filter “HGNC symbol” could be 3 genes “TP53″, “SRY” and “KIAA1199″.
But how do you know what attributes and filters are available? If you guessed listAttributes() and listFilters(), you’d be right.
You can search for specific attributes by running grep() on the name. For example, if you’re looking for Affymetrix microarray probeset IDs:
4. A worked example
As an example, let’s find SNPs on the Y chromosome which have starts that are located within exons. First, we create two mart objects for the genes and SNP datasets. We query the first mart for exons on chromosome Y and the second for SNPs on the same chromosome. Finally we’ll employ the incredibly-useful findOverlaps() function from the GenomicRanges package to get the exonic SNPs.
There are various ways to run BioMart queries; I find the simplest is to use getBM().
Filed under: bioinformatics, programming, R, statistics Tagged: biomart, how to, tutorial