Planet.python.org

Matthew Rocklin: Dask Development Log

2017-02-20

This work is supported by Continuum Analytics
the XDATA Program
and the Data Driven Discovery Initiative from the Moore
Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask
and related projects during the previous week. This log covers work done
between 2017-02-01 and 2017-02-20. Nothing here is ready for production. This
blogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

Profiling experiments with Dask-GLM

Subsequent graph optimizations, both non-linear fusion and avoiding
repeatedly creating new graphs

Tensorflow and Keras experiments

XGBoost experiments

Dask tutorial refactor

Google Cloud Storage support

Cleanup of Dask + SKLearn project

Dask-GLM and iterative algorithms

Dask-GLM is currently just a bunch of solvers like Newton, Gradient Descent,
BFGS, Proximal Gradient Descent, and ADMM. These are useful in solving
problems like logistic regression, but also several others. The mathematical
side of this work is mostly done by Chris White
and Hussain Sultan at Capital One.

We’ve been using this project also to see how Dask can scale out machine
learning algorithms. To this end we ran a few benchmarks here:
https://github.com/dask/dask-glm/issues/26 . This just generates and solves
some random problems, but at larger scales.

What we found is that some algorithms, like ADMM perform beautifully, while
for others, like gradient descent, scheduler overhead can become a substantial
bottleneck at scale. This is mostly just because the actual in-memory
NumPy operations are so fast; any sluggishness on Dask’s part becomes very
apparent. Here is a profile of gradient descent:

Notice all the white space. This is Dask figuring out what to do during
different iterations. We’re now working to bring this down to make all of the
colored parts of this graph squeeze together better. This will result in
general overhead improvements throughout the project.

Graph Optimizations - Aggressive Fusion

We’re approaching this in two ways:

More aggressively fuse tasks together so that there are fewer blocks for
the scheduler to think about

Avoid repeated work when generating very similar graphs

In the first case, Dask already does standard task fusion. For example, if you
have the following to tasks:

Dask (along with every other compiler-like project since the 1980’s) already
turns this into the following:

What’s tricky with a lot of these mathematical or optimization algorithms
though is that they are mostly, but not entirely linear. Consider the
following example:

Visualized as a node-link diagram, this graph looks like a diamond like the following:

Graphs like this generally don’t get fused together because we could compute
both exp(x) and 1/x in parallel. However when we’re bound by scheduling
overhead and when we have plenty of parallel work to do, we’d prefer to fuse
these into a single task, even though we lose some potential parallelism.
There is a tradeoff here and we’d like to be able to exchange some parallelism
(of which we have a lot) for less overhead.

PR here dask/dask #1979 by Erik
Welch (Erik has written and maintained most of
Dask’s graph optimizations).

Graph Optimizations - Structural Sharing

Additionally, we no longer make copies of graphs in dask.array. Every
collection like a dask.array or dask.dataframe holds onto a Python dictionary
holding all of the tasks that are needed to construct that array. When we
perform an operation on a dask.array we get a new dask.array with a new
dictionary pointing to a new graph. The new graph generally has all of the
tasks of the old graph, plus a few more. As a result, we frequently make
copies of the underlying task graph.

Normally this doesn’t matter (copying graphs is usually cheap) but it can
become very expensive for large arrays when you’re doing many mathematical
operations.

Now we keep dask graphs in a custom mapping (dict-like object) that shares
subgraphs with other arrays. As a result, we rarely make unnecessary copies
and some algorithms incur far less overhead. Work done in
dask/dask #1985.

TensorFlow and Keras experiments

Two weeks ago I gave a talk with Stan Seibert
(Numba developer) on Deep Learning (Stan’s bit) and Dask (my bit). As part of
that talk I decided to launch tensorflow from Dask and feed Tensorflow from a
distributed Dask array. See this
blogpost for
more information.

That experiment was nice in that it showed how easy it is to deploy and
interact with other distributed servies from Dask. However from a deep
learning perspective it was immature. Fortunately, it succeeded in attracting
the attention of other potential developers (the true goal of all blogposts)
and now Brett Naul is using Dask to manage his GPU
workloads with Keras. Brett contributed
code to help Dask move around
Keras models. He seems to particularly value Dask’s ability to manage
resources to help
him fully saturate the GPUs on his workstation.

XGBoost experiments

After deploying Tensorflow we asked what would it take to do the same for
XGBoost, another very popular (though very different) machine learning library.
The conversation for that is here: dmlc/xgboost
#2032 with prototype code here
mrocklin/dask-xgboost. As with
TensorFlow, the integration is relatively straightforward (if perhaps a bit
simpler in this case). The challenge for me is that I have little concrete
experience with the applications that these libraries were designed to solve.
Feedback and collaboration from open source developers who use these libraries
in production is welcome.

Dask tutorial refactor

The dask/dask-tutorial project on
Github was originally written or PyData Seattle in July 2015 (roughly 19 months
ago). Dask has evolved substantially since then but this is still our only
educational material. Fortunately Martin
Durant is doing a pretty
serious rewrite, both correcting parts that are no longer modern API, and also
adding in new material around distributed computing and debugging.

Google Cloud Storage

Dask developers (mostly Martin) maintain libraries to help Python users connect
to distributed file systems like HDFS (with
hdfs3, S3 (with
s3fs, and Azure Data Lake (with
adlfs), which
subsequently become usable from Dask. Martin has been working on support for
Google Cloud Storage (with gcsfs) with
another small project that uses the same API.

Cleanup of Dask+SKLearn project

Last year Jim Crist published
three
great
blogposts about using Dask
with SKLearn. The result was a small library
dask-learn that had a variety of
features, some incredibly useful, like a cluster-ready Pipeline and
GridSearchCV, other less so. Because of the experimental nature of this work
we had labeled the library “not ready for use”, which drew some curious
responses from potential users.

Jim is now busy dusting off the project, removing less-useful parts and
generally reducing scope to strictly model-parallel algorithms.