% filter(biotype=="protein_coding" & chr=="1") %>% select(ensgene, symbol, chr, start, end, description) %>% head %>% pander::pandoc.table(split.table=100, justify="llllll", style="rmarkdown")ensgenesymbolchrstartendENSG00000158014SLC30A212603725226046133ENSG00000173673HES3162441926245578ENSG00000243749ZMYM6NB13498153534985353ENSG00000189410SH2D512071973220732837ENSG00000116863ADPRHL213608887536093932ENSG00000188643S100A161153606886153613145Table: Table continues belowdescriptionsolute carrier family 30 (zinc transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:11013]hes family bHLH transcription factor 3 [Source:HGNC Symbol;Acc:HGNC:26226]ZMYM6 neighbor [Source:HGNC Symbol;Acc:HGNC:40021]SH2 domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28819]ADP-ribosylhydrolase like 2 [Source:HGNC Symbol;Acc:HGNC:21304]S100 calcium binding protein A16 [Source:HGNC Symbol;Acc:HGNC:20441]Example with RNA-seq dataHere’s an example with RNA-seq data. Specifically, DESeq2 results from the airway package, made tidy with biobroom:# Load libraries (install with Bioconductor if you don't have them)library(DESeq2)library(airway)# Load the data and do the RNA-seq data analysisdata(airway)airway = DESeqDataSet(airway, design = ~cell + dex)airway = DESeq(airway)res = results(airway)# tidy results with biobroomlibrary(biobroom)res_tidy = tidy.DESeqResults(res)head(res_tidy)## Source: local data frame [6 x 7]## ## gene baseMean estimate stderror statistic## (chr) (dbl) (dbl) (dbl) (dbl)## 1 ENSG00000000003 708.6021697 0.37424998 0.09873107 3.7906000## 2 ENSG00000000005 0.0000000 NA NA NA## 3 ENSG00000000419 520.2979006 -0.20215550 0.10929899 -1.8495642## 4 ENSG00000000457 237.1630368 -0.03624826 0.13684258 -0.2648902## 5 ENSG00000000460 57.9326331 0.08523370 0.24654400 0.3457140## 6 ENSG00000000938 0.3180984 0.11555962 0.14630523 0.7898530## Variables not shown: p.value (dbl), p.adjusted (dbl)Now, make a table with the results (unfortunately, it’ll be split in this display, but you can write this to file to see all the columns in a single row):res_tidy %>% arrange(p.adjusted) %>% head(20) %>% inner_join(grch38, by=c("gene"="ensgene")) %>% select(gene, estimate, p.adjusted, symbol, description) %>% pander::pandoc.table(split.table=100, justify="lrrll", style="rmarkdown")geneestimatep.adjustedsymbolENSG00000152583-4.3164.753e-134SPARCL1ENSG00000165995-3.1891.44e-133CACNB2ENSG00000101347-3.6186.619e-125SAMHD1ENSG00000120129-2.8716.619e-125DUSP1ENSG00000189221-3.2319.468e-119MAOAENSG00000211445-3.5533.94e-107GPX3ENSG00000157214-1.9498.74e-102STEAP2ENSG00000162614-2.0033.052e-98NEXNENSG00000125148-2.1671.783e-92MT2AENSG00000154734-2.2864.522e-86ADAMTS1ENSG00000139132-2.1812.501e-83FGD4ENSG00000162493-1.8584.215e-83PDPNENSG000001626923.4533.563e-82VCAM1ENSG00000179094-3.0441.199e-81PER1ENSG00000134243-2.1492.73e-81SORT1ENSG00000163884-4.0791.073e-80KLF15ENSG000001786952.4466.275e-75KCTD12ENSG000001462502.641.143e-69PRSS35ENSG00000198624-2.7841.707e-69CCDC69ENSG000001488481.7831.762e-69ADAM12Table: Table continues belowdescriptionSPARC-like 1 (hevin) [Source:HGNC Symbol;Acc:HGNC:11220]calcium channel, voltage-dependent, beta 2 subunit [Source:HGNC Symbol;Acc:HGNC:1402]SAM domain and HD domain 1 [Source:HGNC Symbol;Acc:HGNC:15925]dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064]monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833]glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555]STEAP family member 2, metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885]nexilin (F actin binding protein) [Source:HGNC Symbol;Acc:HGNC:29557]metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406]ADAM metallopeptidase with thrombospondin type 1 motif, 1 [Source:HGNC Symbol;Acc:HGNC:217]FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125]podoplanin [Source:HGNC Symbol;Acc:HGNC:29602]vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663]period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845]sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]Kruppel-like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536]potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678]protease, serine, 35 [Source:HGNC Symbol;Acc:HGNC:21387]coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487]ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190]Explore!This data can also be used for toying around with dplyr verbs and generally getting a sense of what’s in here. First, tet some help.ls("package:annotables")?grch38Let’s join the transcript table to the gene table. gt = grch38_gt %>% inner_join(grch38, by="ensgene")Now, let’s filter to get only protein-coding genes, group by the ensembl gene ID, summarize to count how many transcripts are in each gene, inner join that result back to the original gene list, so we can select out only the gene, number of transcripts, symbol, and description, mutate the description column so that it isn’t so wide that it’ll break the display, arrange the returned data descending by the number of transcripts per gene, head to get the top 10 results, and optionally, pipe that to further utilities to output a nice HTML table.gt %>% filter(biotype=="protein_coding") %>% group_by(ensgene) %>% summarize(ntxps=n_distinct(enstxp)) %>% inner_join(grch38, by="ensgene") %>% select(ensgene, ntxps, symbol, description) %>% mutate(description=substr(description, 1, 20)) %>% arrange(desc(ntxps)) %>% head(10) %>% pander::pandoc.table(split.table=100, justify="lrll", style="rmarkdown")ensgenentxpssymboldescriptionENSG0000016579577NDRG2NDRG family member 2ENSG0000020533677ADGRG1adhesion G protein-cENSG0000019662875TCF4transcription factorENSG0000016124968DMKNdermokine [Source:HGENSG0000015455664SORBS2sorbin and SH3 domaiENSG0000016644462ST5suppression of tumorENSG0000020458058DDR1discoidin domain recENSG0000008746057GNASGNAS complex locus [ENSG0000016939857PTK2protein tyrosine kinENSG0000010452956EEF1Deukaryotic translatiLet’s look up DMKN (dermkine) in Ensembl. Search Ensembl for ENSG00000161249, or use this direct link. You can browse the table or graphic to see the splicing complexity in this gene. Or, let’s do something different. Let’s group the data by what type of gene it is (e.g., protein coding, pseudogene, etc), get the number of genes in each category, and plot the top 20. library(ggplot2)grch38 %>% group_by(biotype) %>% summarize(n=n_distinct(ensgene)) %>% arrange(desc(n)) %>% head(20) %>% ggplot(aes(reorder(biotype, n), n)) + geom_bar(stat="identity") + xlab("Type") + theme_bw() + coord_flip()Annotables: R data package for annotating/converting Gene IDsGetting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License." />