2015-05-08

I finally got a chance to more thoroughly read Mark Stalzer and Chris
Mentzel's arxiv preprint, "A Preliminary Review of Influential Works
in Data-Driven Discovery". This
is a short review paper that discusses concepts highlighted by the
1,000+ "influential works" lists submitted to the Moore Foundation's
Data Driven Discovery (DDD) Investigator Competition. (Note, I was
one of the awardees.)

The core of this arxiv preprint is the section on "Clusters of
Influential Works", in which Stalzer & Mentzel go in detail through
the eight different concept clusters that emerged from their analysis
of the submissions. This is a fascinating section that should be
at the top of everyone's reading list. The topics covered are, in
the order presented in the paper, as follows:

Foundational theory, including Bayes' Theorem, information theory, and
Metropolis sampling;

Astronomy, and specifically the Sloan Digital Sky Survey;

Genomics, focused around the Human Genome Project and methods for
searching and analyzing sequencing data;

Classical statistical methods, including the lasso, bootstrap methods,
boosting, expectation-maximization, random forests, false discovery rate,
and "isomap" (which I'd never heard of!);

Machine learning, including Support Vector Machines, artificial Neural
Networks (and presumably deep learning?), logistic belief networks,
and hidden Markov models;

The Google! Including PageRank, MapReduce, and "the overall anatomy"
of how Google does things; specific implementations included Hadoop,
BigTable, and Cloud DataFlow.

General tools, programming languages, and computational methods,
including Numerical Recipes, the R language, the IPython Notebook
(Project Jupyter), the Visual Display of Quantitative Information,
and SQL databases;

Centrality of the Scientific Method (as opposed to specific tools or
concepts). Here the discussion focused around the Fourth Paradigm
book which lays out the expansion of the scientific method from
empirical observation to theory to simulation to "big data science";
here, I thought the point that computers were used for both theory
and observation was well-made. This section is particularly worth
reading, in my opinion.

This collection of concepts is simply delightful - Stalzer and Mentzel
provide both a summary of the concepts and a fantastic curated set of
high-level references.

Since I don't know many of these areas that well (I've heard of most
of the subtopics, but I'm certainly not expert in ... any of them?
yikes) I evaluated the depth of their discussion by looking at the
areas I was most familiar with - genomics and tools/languages/methods.
My sense from this was that they covered the highlights of tools
better than the highlights of genomics, but this may well be because
genomics is a much larger and broader field at the moment.

Data-Driven Discovery vs Data Science

One interesting question that comes up frequently is what the
connection and overlap is between data-driven discovery, data science,
big data, data analysis, computational science, etc. This paper
provides a lot of food for thought and helps me draw some
distinctions. For example, it's clear that computational science
includes or at least overlaps with all of the concepts above, but
computational science also includes things like modeling that I don't
think clearly fit with the "data-driven discovery" theme. Similarly,
in my experience "data science" encompasses tools and methods, along
with intelligent application of them to specific problems, but
practically speaking does not often integrate with theory and
prediction. Likewise, "big data", in the sense of methods and
approaches designed to scale to analysis and integration of large data
set, is clearly one important aspect of data-driven discovery - but
only in the sense that in many cases more data seems to be better.

Ever since the "cage match" round of the Moore DDD competition, where
we discussed these issues in breakout groups, I've been working
towards the internal conclusion that data-driven discovery is the
exploration and acceleration of science through development of new
data science theory, methods, and tools. This paper certainly helps
nail that down by summarizing the components of "data driven
discovery" in the eyes of its practitioners.

Is this a framework for a class or graduate training theme?

I think a lot about research training, in several forms. I do a lot
of short-course peer instruction form (e.g. Data Carpentry, Software
Carpentry, and my DIB efforts); I've been talking with people about
graduate courses and graduate curricula, with especial emphasis on
data science (e.g. the Data Science Initiative at UC Davis); and, most generally,
I'm interested in "what should graduate students know if they want to
work in data-driven discovery"?

From the training perspective, this paper lays out the central
concepts that could be touched on either in a survey course or in
an entire graduate program; while my sense is that a PhD would
require coupling to a specific domain, I could certainly imagine a
Master's program or a dual degree program that touched on the
theory and practice of data driven discovery.

For one example, I would love to run a survey course on these topics,
perhaps in the area of biology. Such a course could go through
each of the subsections above, and discuss them in relation to
biology - for example, how Bayes' Theorem is used in medicine,
or how concepts from the Sloan Digital Sky Survey could be applied
to genomics, or where Google-style infrastructure could be used
to support research.

There's more than enough meat in there to have a whole graduate
program, though. One or two courses could integrate theory and tools,
another course could focus on practical application in a specific
domain, a third course could talk about general practice and computing
tools, and a fourth course could discuss infrastructure and scaling.

The missing bits - "open science" and "training"

Something that I think was missing from the paper was an in-depth
perspective on the role that open source, open data, and open science
can play. While these concepts were directly touched on in a few of
the subsections - most of the tools described were open source, for
example, and Michael Nielsen's excellent book "Reinventing Discovery"
was mentioned briefly in the context of network effects in scientific
communication and access - I felt that "open science" was an
unacknowledged undercurrent throughout.

It's clear that progress in science has always relied on sharing
ideas, concepts, methods, theory, and data. What I think is not yet
as clear to many is the extent to which practical, efficient, and
widely available implementations of methods have become important in
the computer age. And, for data-driven discovery, an increasingly
critical aspect is the infrastructure to support data sharing,
collaboration, and the application of these methods to large data
sets. These two themes -- sharing of implementation and importance
of infrastructure cut across many of the subsections in this paper,
including the specific domains of astronomy and human genomics, as
well as the Google infrastructure and languages/tools/implementation
subsections. I think the paper could usefully add a section on this.

Interestingly, the Moore Foundation DDD competition implicitly
acknowledged this importance by enriching for open scientists in
their selection of the awardees -- a surprising fraction of the
Investigators are active in open science, including myself and Ethan
White, and virtually all the Investigators are openly distributing
their research methodology. In that sense, open science is a notable
omission from the paper.

It's also interesting to note that training is missing from the
paper. If you believe data-driven discovery is part of the future of
science, then training is important because there's a general lack of
researchers and institutions that cover these topics. I'd guess that
virtually no one researcher is well versed in a majority of the
topics, especially since many of the topics are entire scientific
super-fields, and the rest are vast technical domains. In academic
research we're kind of used to the idea that we have to work in
collaboration (practice may be different...), but here academia
really fails to cover the entire data-driven discovery spectrum
because of the general lack of emphasis on expert use of tools and
infrastructure in universities.

So I think that investment in training is where the opportunities lie
for universities that want to lead in data-driven discovery, and this
is the main chance for funders that want to enable the network effect.

Training in open science, tools, and infrastructure as competitive advantages

Forward-thinking universities who are in it for the long game &
interested in building a reputation in data-driven discovery, might
consider the following ideas:

scientists trained in open science, tool use, and how to use
existing infrastructure, are more likely to be able to quickly take
advantages of new data and methods.

scientists trained in open science are more likely to produce
results that can be built on.

scientists trained in open science are more likely to produce useful
data sets.

scientists trained in open science and tool building are more likely
to produce useful tools.

funding agencies are increasingly interested in maximizing impact by
requiring open source, open data, and open access.

All of these should lead to more publications, more important
publications, a better reputation, and more funding.

In sum, I think investments in training in the most ignored bits of
data-driven discovery (open science, computational tool use and
development, and scalable infrastructure use and development) should
be a competitive advantage for institutions. And, like most
competitive advantages, those who ignore it will be at a significant
disadvantage. This is also an opportunity for foundations to drive
progress by targeted investments, although (since they are much more
nimble than universities) they are already doing this to some extent.

In the end, what I like most about this paper is that it outlines and
summarizes the concepts in which we need to invest in order to advance
science through data-driven discovery. I think it's an important
contribution and I look forward to its further development and
ultimate publication!

--titus

Show more