R-bloggers.com

Annotables: R data package for annotating/converting Gene IDs

2015-11-13

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

I work with gene lists on a nearly daily basis. Lists of genes near ChIP-seq peaks, lists of genes closest to a GWAS hit, lists of differentially expressed genes or transcripts from an RNA-seq experiment, lists of genes involved in certain pathways, etc. And lots of times I’ll need to convert these gene IDs from one identifier to another. There’s no shortage of tools to do this. I use Ensembl Biomart. But I do this so often that I got tired of hammering Ensembl’s servers whenever I wanted to convert from Ensembl to Entrez gene IDs for pathway mapping, get the chromosomal location for some BEDTools-y kinds of genomic arithmetic, or get the gene symbol and full description for reporting. So I used Biomart to retrieve the data that I use most often, cleaned up the column names, and saved this data as an R data package called annotables.

This package has basic annotation information from Ensembl release 82 for:

Human (grch38)

Mouse (grcm38)

Rat (rnor6)

Chicken (galgal4)

Worm (wbcel235)

Fly (bdgp6)

Where each table contains:

ensgene: Ensembl gene ID

entrez: Entrez gene ID

symbol: Gene symbol

chr: Chromosome

start: Start

end: End

strand: Strand

biotype: Protein coding, pseudogene, mitochondrial tRNA, etc.

description: Full gene name/description.

Additionally, there are tables for human and mouse (grch38_gt and grcm38_gt, respectively) that link ensembl gene IDs to ensembl transcript IDs.

Usage

The package isn’t on CRAN, so you’ll need devtools to install it.

It isn’t necessary to load dplyr, but the tables are tbl_df and will print nicely if you have dplyr loaded.

Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):

Look at the human genes-to-transcripts table:

Tables are tbl_df, pipe-able with dplyr:

ensgene

symbol

chr

start

end

ENSG00000158014

SLC30A2

1

26037252

26046133

ENSG00000173673

HES3

1

6244192

6245578

ENSG00000243749

ZMYM6NB

1

34981535

34985353

ENSG00000189410

SH2D5

1

20719732

20732837

ENSG00000116863

ADPRHL2

1

36088875

36093932

ENSG00000188643

S100A16

1

153606886

153613145

Table: Table continues below

description

solute carrier family 30 (zinc transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:11013]

hes family bHLH transcription factor 3 [Source:HGNC Symbol;Acc:HGNC:26226]

ZMYM6 neighbor [Source:HGNC Symbol;Acc:HGNC:40021]

SH2 domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28819]

ADP-ribosylhydrolase like 2 [Source:HGNC Symbol;Acc:HGNC:21304]

S100 calcium binding protein A16 [Source:HGNC Symbol;Acc:HGNC:20441]

Example with RNA-seq data

Here’s an example with RNA-seq data. Specifically, DESeq2 results from the airway package, made tidy with biobroom:

Now, make a table with the results (unfortunately, it’ll be split in this display, but you can write this to file to see all the columns in a single row):

gene

estimate

p.adjusted

symbol

ENSG00000152583

-4.316

4.753e-134

SPARCL1

ENSG00000165995

-3.189

1.44e-133

CACNB2

ENSG00000101347

-3.618

6.619e-125

SAMHD1

ENSG00000120129

-2.871

6.619e-125

DUSP1

ENSG00000189221

-3.231

9.468e-119

MAOA

ENSG00000211445

-3.553

3.94e-107

GPX3

ENSG00000157214

-1.949

8.74e-102

STEAP2

ENSG00000162614

-2.003

3.052e-98

NEXN

ENSG00000125148

-2.167

1.783e-92

MT2A

ENSG00000154734

-2.286

4.522e-86

ADAMTS1

ENSG00000139132

-2.181

2.501e-83

FGD4

ENSG00000162493

-1.858

4.215e-83

PDPN

ENSG00000162692

3.453

3.563e-82

VCAM1

ENSG00000179094

-3.044

1.199e-81

PER1

ENSG00000134243

-2.149

2.73e-81

SORT1

ENSG00000163884

-4.079

1.073e-80

KLF15

ENSG00000178695

2.446

6.275e-75

KCTD12

ENSG00000146250

2.64

1.143e-69

PRSS35

ENSG00000198624

-2.784

1.707e-69

CCDC69

ENSG00000148848

1.783

1.762e-69

ADAM12

Table: Table continues below

description

SPARC-like 1 (hevin) [Source:HGNC Symbol;Acc:HGNC:11220]

calcium channel, voltage-dependent, beta 2 subunit [Source:HGNC Symbol;Acc:HGNC:1402]

SAM domain and HD domain 1 [Source:HGNC Symbol;Acc:HGNC:15925]

dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064]

monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833]

glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555]

STEAP family member 2, metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885]

nexilin (F actin binding protein) [Source:HGNC Symbol;Acc:HGNC:29557]

metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406]

ADAM metallopeptidase with thrombospondin type 1 motif, 1 [Source:HGNC Symbol;Acc:HGNC:217]

FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125]

podoplanin [Source:HGNC Symbol;Acc:HGNC:29602]

vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663]

period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845]

sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]

Kruppel-like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536]

potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678]

protease, serine, 35 [Source:HGNC Symbol;Acc:HGNC:21387]

coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487]

ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190]

Explore!

This data can also be used for toying around with dplyr verbs and generally getting a sense of what’s in here. First, tet some help.

Let’s join the transcript table to the gene table.

Now, let’s filter to get only protein-coding genes, group by the ensembl gene ID, summarize to count how many transcripts are in each gene, inner join that result back to the original gene list, so we can select out only the gene, number of transcripts, symbol, and description, mutate the description column so that it isn’t so wide that it’ll break the display, arrange the returned data descending by the number of transcripts per gene, head to get the top 10 results, and optionally, pipe that to further utilities to output a nice HTML table.

ensgene

ntxps

symbol

description

ENSG00000165795

77

NDRG2

NDRG family member 2

ENSG00000205336

77

ADGRG1

adhesion G protein-c

ENSG00000196628

75

TCF4

transcription factor

ENSG00000161249

68

DMKN

dermokine [Source:HG

ENSG00000154556

64

SORBS2

sorbin and SH3 domai

ENSG00000166444

62

ST5

suppression of tumor

ENSG00000204580

58

DDR1

discoidin domain rec

ENSG00000087460

57

GNAS

GNAS complex locus [

ENSG00000169398

57

<td style="border: 1px solid rgb(204, 204,