Liorpachter.wordpress.com

Response to: “GTEx is throwing away 90% of their data”

2013-10-31

A blogpost by Lior Pachter at http://liorpachter.wordpress.com/2013/10/21/gtex/ suggested that GTEx uses a suboptimal transcript quantification method. He showed using simulated data, that the method used by GTEx obtained a quality score that can be reached by a different method using 10-fold less data. This result led to the sensational subject line: “GTEx is throwing away 90% of their data”. We take such concerns very seriously and intend to be transparent to the community regarding our decisions. We stand by the processes and decisions we have taken to date given the data and time constraints but of course always remain open to community suggestions for improvement. We have answered the questions raised by the blogpost below. We thank Lior Pachter for hosting this in his blog.

- Is GTEx LITERALLY throwing away 90% of the data?

Absolutely not! The GTEx project is not throwing away any data at all. In fact, it provides all data, in both raw and processed forms, to the scientific community at dbGAP site (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v3.p1) or on the GTEx portal (http://www.broadinstitute.org/gtex/). GTEx is running multiple tools to perform various analytical steps. The initial analysis of the GTEx pilot project data is completed and frozen and the consortium is in the process of writing a manuscript describing the results. For transcript abundance quantification, which comprises a small fraction of the overall analyses of the pilot data, the consortium used FluxCapacitor and provided the outputs on the GTEx portal. Specifically, these results were used for analyses of splice- and isoform-QTLs and for assessing tissue-specificity of loss-of-function variants.). Tools used in the pilot phase of GTEx will be re-evaluated, based on benchmarking results and large-scale production consideration (see below), and alternative methods, including for transcript quantification, may be employed in the future.

- Is the FluxCapacitor (FC) a published method?

Yes, although not as a standalone methodological paper but as part of a large study where RNAseq data were used to perform eQTL analysis (Montgomery, Nature 2010). That paper was peer reviewed and includes a 1.5 page description of the methodology in the methods section and a supplementary figure (Supplementary Figure 23) describing the method’s outline (http://www.nature.com/nature/journal/v464/n7289/full/nature08903.html). In addition, there is extensive documentation on the web at http://sammeth.net/confluence/display/FLUX/Home supporting the publication and offering download of the software. Indeed, the level of detail of the description of the method in the paper is not as comprehensive as in some standalone methods papers. We thank Lior for pointing this out and note that the documentation has been updated and will continue to be improved by the authors of FC.

- Why did GTEx decide to use the FluxCapacitor method in it’s pilot analyses?

In order to understand the decision to use FluxCapacitor (FC) in GTex, one first needs to understand how such decisions are made in GTEx (see below). Initially we used Cufflinks (CL), which is the most commonly used tool in the field for quantifying isoforms. However, when using it at large scale (1000s of samples) we hit technical problems of large memory use and long compute times. We attempted to overcome these difficulties, and investigated the possibility of parallelizing CL and contacted the CL developers for help. However, the developers advised us that CL could not be parallelized at that point. Due to project timelines, we started investigating alternative methods. The Guigo group has already produced transcript quantifications with the FC on GTEx data and demonstrated biologically meaningful high quality results (http://www.gtexportal.org/home?page=transcriptQuant) and provided the tool and support to test it and install as part of the GTEx production analysis pipeline. Our current experience with FC is that it scales well and can be used in a production setting.

- Are exon level or gene level read counts appropriate for the purposes of GTEx and Geuvadis eQTL analyses?

Another criticism raised in the blogpost was regarding the use of exon-level and gene-level read counts, as opposed to transcript abundance levels, for calculating eQTLs in both the GTEx pilot project as well as in the recently published GEUVADIS paper (Lappalainen et al. Nature, 2013). The eQTL analyses for both projects mainly used exon-level and gene-level expression values but did not use simple read counts, rather they used carefully normalized values after correcting for multiple covariates (as described in Lappalainen et al.). Transcript (or isoform) abundance levels, which are the subject of Lior Pachter’s simulations, require more sophisticated estimation methods (such as FC, CL or others) since one needs to deconvolute the contribution of each isoform to the exon-level signals (as different isoforms often share exons). We deliberately did not use isoform-abundance levels for the major part of the analysis since this is still a very active area of research and new tools continue to be developed. The analyses in Lappalainen et al. using transcript quantifications, such as population differences, were replicated with other approaches with highly consistent results (Fig. 1c, Fig. S14).

Our results suggest that exon-level and gene-level quantifications are much more robust in cis-eQTL analysis. In the GEUVADIS paper we discovered ~7800 genes with exon-level eQTLs, and ~3800 genes with gene-level eQTLs. Initial tests for eQTL discovery using transcript quantifications from FC found <1000 genes with transcript-level eQTLs, and led to or decision to use the much more powerful exon and gene quantifications in that paper. Given the relatively small difference in correlation coefficients described in Lior Pachter’s post between FC and CL (~83% vs. ~92% for 20M fragments), and our previous experience mentioned above, we find it very unlikely that CL or any other transcript quantification method would discover a number of genes with transcript eQTLs anywhere near the 3800 or the 7800 figure. This suggests that transcript quantification methods do not capture biological variation as well as more straightforward exon and gene quantifications.

The issue that was raised in the blog is regarding estimating relative expression levels of isoforms within a single sample and is measured using the Spearman correlation metric. However, for single-gene eQTL, which is the main goal of GTEx pilot analysis, one searches for significant correlations between genotypes and expression levels of a single gene when compared across subjects. This correlation is robust against shifts (ie. adding or subtracting a constant value) and changes in scale (applying a constant factor) of the expression levels. Therefore, normalizing to the alignable territory of the gene (which will introduce the same factor across samples), and ignoring ambiguously mapped reads (which likely introduces a constant shift) will have little or no effect on the resulting eQTLs. We are constantly evaluating further quantification methods, but we believe that the eQTL results calculated as part of the GTEx pilot and the GEUVADIS paper are solid and robust.

- How does GTEx decide on methodologies to be used?

GTEx strives to provide to the community the highest quality raw and derived data as well as the highest quality scientific results. GTEx also operates with clear and rigid timelines and deliverables. Therefore, we prefer to use methods that already have been vetted by the community and can be used in a large-scale production setting. In this particular case the experience with FC as part of the GEUVADIS project was part of the consideration.

Systematic benchmarking of tools is very important and we encourage the community to conduct such benchmarks. Proper benchmarking is not a simple task — one needs to carefully define the benchmarking metrics (which depend on the particular downstream use of the data), often there are not sufficient ground truth data, and simulations can often be deceiving since they don’t reflect real biology or experimental data. Therefore, whenever there are no published benchmarks that use metrics that are relevant to the GTEx project, we have to perform them within the project (prioritized by the impact on the results). One example is a recent comparison of different alignment methods (including TopHat and GEM), which was recently presented at the ASHG conference in Boston.

Although isoform-level quantification is only a minor part of our current analyses, we continue to evaluate several methods (including FC and CL) and welcome any constructive input on this evaluation from the community. The results obtained by Pachter’s simulations suggest that CL outperforms FC (based on the Spearman correlation metric) but this seems to be at odds with our experience (perhaps due to unrealistic assumptions in the simulations). Clearly, further benchmarking is required to better understand the differences between tools and their effect on the final results (http://www.gtexportal.org/home?page=transcriptQuant).

- Is GTEx open to feedback and what are the mechanisms in place?

GTEx welcomes and encourages feedback and interaction with outside investigators to improve the analysis and data production. We offer several mechanisms for interaction: (i) The GTEx datasets are available for download (some require applying for access to protect donor privacy) and have already been shared with over 100 different research groups that carry out their own analyses of the data; (ii) We had a widely-attended international community meeting in June and plan to hold such meetings yearly, giving the opportunity for external groups to share their results; and (iii) we welcome e-mails to GTEXInfo@mail.nih.gov and comments in the GTEx portal (http://www.broadinstitute.org/gtex/). Finally, we are interested in facilitating systematic benchmarking of tools and investigators that are interested in participating in such benchmarks or in defining the evaluation metrics are welcome to contact us.

To summarize, GTEx strives to produce, and make publicly-available in a timely manner, the best possible data and analysis results, within data release and practical limitations. By no means do we feel that all our analyses are the best possible in all aspects, or that we will perform all the different types of analysis one can do with these data. We are open to constructive feedback regarding the tools we use and the analyses we perform. Finally, all data are available to any investigator who desires to perform novel analyses with their own methods and we anticipate that much improved and innovative analyses of the data will emerge with time.

Manolis Dermitzakis, Gad Getz, Krisitn Ardlie, Roderic Guigo for the GTEx consortium

Filed under: *Seq, RNA-Seq