Planet.python.org

Gaël Varoquaux: Of software and Science. Reproducible science: what, why, and how

2015-12-16

At MLOSS 15 we
brainstormed on reproducible science, discussing why we care about
software in computer science. Here is a summary blending notes from
the discussions with my
opinion.

“Without engineering, science is not more than philosophy”
—
the community

How do we enable better Science? Why do we do software in science?
These are the questions that we were interested in.

Improving reproducility of our scientific studies makes us more
efficient in the long run to do good science: even inside a lab, new
research efforts build upon the previous work.

Forms of reproducible science: reproduction, replication, & reuse

The classic concepts of reproducible science
are:

Reproducibility: being able to rerun an experiment as it was run,
for instance by reanalysing data.

Replicability: being able to redo an experiment from scratch.

The reproducible science movement argues sharing source code of
experiments is a need for reproduction.

For reproduction, fields like computer science (development of methods)
and biology (challenging data acquisition) have very different
constraints, with the complexity allocated differently between data and
code.

“Machine learning people use hugely complex algorithms on trivially
simple datasets. Biology does trivially simple algorithms on hugely
complex datasets.”
—
an MLOSS15 attendee

We felt that computer science needed an additional notion, complementing
replication and reproduction:

Reusability: applying the process to a new yet similar question.
For instance for a paper contributing data analysis method, applying it
to new data.

Reusability is more valuable than reproducibility.

Reproducibility without reusability in method development may hinder the
advancement of science as it pushes people to do all the same
things, eg always running experiments on the same data.

Reusability enables results that the original investigator did not have in
mind. It implies that the experimental protocol extends further than the
exact scope of the question initially asked. For software development, it
is also harder, as it implies more robustness and flexibility.

Finally sharing source code is not enough: readability of the code is
necessary.

Roadblocks to reproducible science

Man power

Reusability, readability, support of released code, all actually take a
lot of time, even though it is seldom acknowledged in talks about
reproducible science. Given a fixed man power, it is impossible to
achieve reusability and high quality for everything.

Computing power

Some numerical experiments or complex data analysis require weeks of
cluster to run. These will be much harder to reproduce. Also, rerunning
an analysis from scratch on a regular basis is a good recipe to achieve a
robust path from data to results. The more computing power is a limiting
resource, the more likely it is that a glitch is not detected.

Data availability

No access, or restricted access, to data is a show stopper for
reproducibility. Data sharing requirements are becoming common –from
funding agencies, or journals. However, privacy concerns, or confidential
information get in the way of making data public, for instance in medical
research or micro-economy. Often, these concerns serve as a pretext
to people who actually do not want to relinquish control [1].

[1]

A related post by Deevy Bishop: Who’s afraid of open data

Incentives problem

Fancy new results are what matters for success in academia. “High impact”
journals such as Nature or Science accept papers that amaze and impress,
often with subpar inspection of the materials and methods [2]. The rate of
publication in many leading groups is incompatible with consolidation
efforts required for strong reproducibility.

On the other hand, it is hard to tell beforehand if a new idea is a good
one. Hence letting imagination forward to foster impossible and
improbable ideas is a good path to innovation. The underlying questions
are: What are the best community rules for the advancement of knowledge?
What do we want from the way science moves forward? Rapid publication of
many incremental ideas, eg at a conference, gives food for thoughts,
possibly at the sake of reproducibility.

[2]

“Science, Nature and Cell, had a higher rate of retractions” –
Wikipedia: Invalid science

How to improve the situation

Docker, containers, and virtual machines

Docker, or other virtual machine technologies, enable shipping a software
environment. It diminishes the challenges of building software and
setting up an analysis. Virtual machines are used as a way to avoid
software packaging issues. This seems to me as a plaster on a wooden leg.

Containers give easy reproduction, to the cost of hard
replication and reuse.

Indeed, an analysis that lives in a box can be reproduced, but can it be
understood, modified, or applied to new data? New science is likely going
to come from modifying this analysis, or combining it with other tools,
or new data. If these other tools live in a different virtual machine,
the combination will be challenging.

In addition, people are using containers as an excuse to avoid tackling
the need for proper documentation of requirements, and the process to set
them up. They sometimes even try justify binary blobs [3]. This is
wrong. An analysis should be runnable without requiring the stars to
align, and it should be understandable.

[3]

See also Titus Brown’s post: The post-apocalyptic world of binary
containers

Version control: wear your seatbelt

Version control
is like a time machine: if used with regular commits, it enables rolling
back to any point in time. For my work, it’s always been a crucial aspect
to reproducing what me or my students did a while ago. I often meet
researchers that feel they lack time to learn it. I really cannot support
this position. http://try.github.io is an easy way to learn version
control.

Hint: use a “tag” to pin-point a position in the history that you might
want to repeat, such as making a figure or the publication of an article.

Sotware libraries, curated and maintained

Consolidating an analysis pipeline, a standard visualization, or any
computational aspect of a paper into a software library is a sure way to
make the paper more reproducible. It will also make the steps reusable,
and a replication easier. If continued effort is put in the library,
chances are that computational efficiency will improve over time, thus
helping in the long run with the challenge of computing power.

Tough choices: not every variant of an analysis can be forever
reproducible.

Maintaining the library will ensure that results are still reproducible
on new hardware, or with evolution of the general software stack (a new
Python or Matlab release, for instance). Documentation and curated
examples will lower the bar to reuse and facilitate replication of the
original scientific results.

To avoid feature creep and technical debt, a library calls for focused
efforts on selecting the most important operations.

Datasets, serving as model experiments, tractable and open

Sometimes researchers create a toy data, with a well-posed question, that
is curated and open, small enough to be tractable yet large enough to be
relevant to the application field. This is an invaluable service to the
field. One example is the netflix prize in machine learning,
which led to a standard dataset. Unfortunately, the dataset was taken
down some years later due to copyright concerns. But it has been
replaced, eg by the movielens dataset. For computer vision, a
series of datasets –Caltech101, CIFAR, ImageNet…– have led to continuous progress of the
field. In bioinformatics, standard data are regularly created, for
instance by the DREAM challenges.

These reference open datasets serve as benchmarks and therefore foster
competition. They also define a canonical experiment, helping a wider
scientific community understand the questions that they ask. Ultimately,
they result in better software tools to solve the problem at hand, as
this problem becomes a standard example and application of tools.

Sage bionetworks, for
instance, is a non-profit that collects and make biomedical data
available. These people believe, as I do, that such data will lead to
better medical care.

Changing incentives: setting the right goals

Making sustainable, quality scientific work that facilitates reproduction
needs to be a clearly-visible benefit to researchers, young and senior.
Such contributions should help them get jobs and grants.

An unsophisticated publication count is the basis of scientific
evaluation. We need to accept publications about data, software, and
replication of prior work in high-quality journals. They need to be
strictly reviewed, to establish high standards on these contributions.
This change is happening. Gigascience, amongst other venues, publishes
data. The MLOSS (machine learning open source software) track of the JMLR (journal of machine learning
research) publishes software, with a tough review on the software quality
of the project.

Researchers should cite the software they use.

Yet software is still often under cited: many will use a software
implementing a method, and only cite the original paper that proposed the
method. Another remaining challenge is: how to give credit for continuing
development and maintenance.

Fast-paced science is probably useful even if fragile. But the difference
between a quick proof of concept and solid, reproducible and reusable
work needs to be acknowledged. It is important to select for publication
not only impressive results, but also sound reusable material and
methods. The latter are the foundation of future scientific developments,
but high-impact journals tend to focus on the former.

Related posts:

Software for reproducible science: let’s not have a misunderstanding

MLOSS 2015: wising up to building open-source machine learning