Podsnap.com

Parallelize all the things -- Deconstructing Haxl, with Clojure macros, topological sorting and Quasar fibers

2016-10-28

Select, few

Recently, there's been a lot of interest in DSLs to address the so-called
N+1 selects problem,
which describes the very common situation where you

Do one database query to get a list of things, say user IDs;

for each, call some sort of processing function that happens to

do another query, say for each user's name.

It can be even worse than that, as in this example, where (for some reason)
we want to append the last name of every person in a group to the first name
of their best friend. (Maybe they're getting married.)

To the database, this looks like a call to select ids in the group, followed by
interleaved calls to getBff, getFirstName, getLastName. So it's sort of
the 3N+1 selects problem.

Code like this can be easily optimized in different ways, depending on
the capabilities of the database. E.g., maybe there are efficient plural versions of
all the get functions:

Or, if the database supports joins directly, we could write a special bespoke
function to do the whole thing for us.

It's also possible that it's ok to continue doing repeated queries for
single values, as long as similar queries are performed at
approximately the same time2. We still have to rewrite the
code so the queries aren't alternating and blocked on each other:

Now, a wave of getLastNames, followed by getBffs, and then the getFirstNames go out,
and assuming that some facility is grouping each batch into a single query, we now have
a 3+1=4 problem.

This is not awful, but... no actually it is. Why should we have to
contort our code around the best practices of whatever database we
happen to be using? Why should we have to refactor our code just
because we switch to a new database that wants the batching to be done
slightly differently? (And you also might wonder why we wait to do the
getBffs until the getLastNames are done.)

Don't you wish you could just sprinkle some pixie dust on the code we
started with, and it would somehow all be ok?

Haskell programmers, delicate dreamers that they are, prominently
bemoaned the lack of pixie dust.
And then they made some.

Haxl arose

In 2014, Simon Marlow (of Parallel and Concurrent Programming in Haskell
fame) and collaborators at Facebook published
There is no Fork: an Abstraction for Efficient, Concurrent, and Concise Data Access.
(There's also a nice talk, and Facebook even open-sourced the
code.)

There's a Haxl-inspired package for
clojure, and
at
least three
for Scala, though the last of these, Stitch, is a private, internal project at Twitter.

At the risk of over-simplifying, I'm going to divide their idea into
two pieces and largely talk about only one of them.

Piece I

One of the pieces
is about how to cause code written using standard Haskell language
features and library abstractions (do notation, applicative
functors - plus a little textual code transformation) to induce some
kind of batched I/O behavior. For example, using their Fetch type
might look like

and the scala implementations are along the lines of

(These might be not exactly right, but since I didn't tag this as a Haskell or Scala post, that's ok,
yes?) There was, of course, concurrency before Haxl, but, they note

All these previous formulations of concurrency monads used some kind
of fork operation to explicitly indicate when to create a new thread
of control. In contrast, in this paper there is no fork. The
concurrency will be implicit in the structure of the computations we
write.

They achieve this by introducing an applicative functor:

This is the key piece of our design: when computations in Fetch
are composed using the <*> operator, both arguments of <*> can
be explored to search for Blocked computations, which creates the
possibility that a computation may be blocked on multiple things
simultaneously. This is in contrast to the monadic bind operator,
>>=, which does not admit exploration of both arguments, because
the right hand side cannot be evaluated without the result from the
left.

This means that there's actually slightly more than the usual do magic going on
In this case, the getBff and getLastName ought to be completely independent, so they
actually preprocess the second do into

using a compiler extension.

There is much
handwringing over
the need to use an applicative implementation that differs from the
default ap defined for monads. Whether or not the controversy
interests you, it's not of much significance in Clojure, where
multi-argument map-like functions abound, and typological conundra
do not keep us up at night.

Moreover, I will argue that the need to use
applicatives in the first place is somewhat artificial. Once we admit
that we're willing to do some sort of dependency-aware code transformation (as with do),
we can avoid applicatives by doing just a little more of it.

Piece II

The language agnostic piece is that code that looks like it's doing normal stuff
like looping and fetching is actually constructing a dependency graph.
Instead of returning results directly, functions return a mutable
Result object that contains either

an actual result value,

a list of other Results that we're waiting on

a description of a query that could fulfill the Result.

The actual query functions are memoized to return a Result object that starts out as
a Result(query="whatever"), but at some point in the future can have a value poked into it.

Non-query function calls are jiggered so that, if any of their arguments is a non-value Result, they
will themselves return a Result(waiting=...), listing those Results, on which they depend.
Eventually, we end up
with a tree of nested Results,
the leaves of which are queries to fulfill. We scoop up all the
leaves, figure out how to fulfill the queries and cram the results
back into their Result object, and repeat the program from the beginning.

On the second run, we get a new batch of leaf queries to fulfill, and we repeat until there are no
blocks, and a result can be computed.

The really important thing here is that, once we write our program
using Haxl, the "scoop" and "figure" work is somewhat decomplected
from program logic. I say somewhat, because in writing with the Haxl
DSL in the first place, we've already made concessions to the fact
that batching is in some way important.

Running mangleNames under a Haxl-like system

It's interesting to trace through our example program, noting what
Result trees might be produced in each pass. Actually, the trees I'm
noting were produced by a quick-and-dirty Clojure implementation that
will be explained in the next section, which you might want to skip to
if you prefer to read Clojure code than to imagine it. (Or if you'll
be annoyed when you discover that the next second is really a superset
of this one.)

Our haxlified mangleNames on the first pass might return something
like,

indicating that we're blocked on a result that's blocked on our first query.

Assume that our system knows how to run the query, the results get shoved back into the Result, giving
us a tree that looks like this:

Now we run the program again, and the memoized groups query returns the now
fulfilled Result[:value [321 123] object, so we block a little further down. The new tree blocked on a different group of queries:

Now we scoop up these two similar queries, let our database combine them which whatever clever magic it prefers,
and inject the answers into the Result objects so the tree looks like

and run again, now yielding a batch of similar queries for name

whose results we stuff back into their Result holders:

And we run the program a fourth time, finally getting back

A shoddy implementation in Clojure

It was instructive for me to hack this up in Clojure, even though there's already a
a more professional
implementation called Muse.

First, note that the Haskell Haxl Result has an IORef in the middle.

They don't actually avoid mutation altogether, and neither will we.

The possibly blocking object

contains a volatile that will hold [:value final-result],
[:query some-query] or [:waiting list-of-results]. I overrode print-method to
make the output a little more presentable here. You probably wouldn't do that in real life.

A basic utility function tries to extract results from a potential Result,
returning the extracted value if possible:

The next most fundamental operation is to apply a function to a collection...

or what we hope is a collection, as it might itself be a Result...

or contain unfulfilled Results...

in which case, we note our dependency on them,

but if the collection is complete, we actually evaluate the function,

Our queries are just placeholders:

And we write some helper methods to invoke functions whose arguments
might be Results or which might return them:

For demo purposes, it will be useful to extract all the query leaf nodes in a tree:

Testing out the shoddy implementation in Clojure

As promised, this is going to look just like the pretend example in the second before last.

We define some query functions,

in terms of which we write the program, using our block- helpers to indicate that the
map or function application might be blocking:

If we were doing the queries sequentially, there we would be alternating
calls to getBff and getFirstName, which would be difficult to optimize.
Let's see what our haxly conniving gets us:

As expected, we're blocked on the first query, which we pretend returns [321 123]:

Now run the exact same function again,

This time we get all the queries depending on the user ids we have, and we
fake fulfillment:

and again

and again,

Exactly what problem have we solved?

Or at least what problem would we have solved if we hadn't typed in the query responses manually?

Actually, I think we've solved three problems that are only incidentally related.

Problem 1: We've managed to batch similar queries together. That's
nice, but, as noted before, the database might have done this anyway
had we simply sent them at approximately the same time.

Problem 2: We've desequentialized individual, interleaved queries,
so they can be written to reflect the logical use of their results
rather than the preferred timing of execution.

Problem 3: We did this without explicitly forking.

If you're a web developer, you might think that this is like an elixir for gout,
as a binge on blocking IO is a bit of a luxury to begin with.
Had we been forced from the beginning to write non-blocking code - say, in javascript -
the sequentiality problem would likely never have arisen.1 In fact, it might
have required willful effort to force the queries to run sequentially and interleaved.

To prove the point, let's temporarily drop the "there is no fork" part of our
adventure. We'll mock asynchronous queries as returning a core.async channel
that prints to stdout so we can see what it's doing and then, after a short delay,
just munges some strings:

The simplest async version of our program just launches the queries and dereferences
them immediately, using async/map to turn a sequence of channels into a single channel:

With some minor reformatting:

that all the getBff queries were sent without waiting before the getName queries,
but with a lot less fuss and mutation, so maybe we were in nirvana all along...

Not so fast

First, while it might be second nature to clojurescript (or whatever
script) developers, this (a/map vector (map #(go (<! aquery ...) ...)
...) business is already a bit removed from plain old (map (aquery
..)). Without thinking about it too much, we complected our code
beyond its functional intent to also express an opinion about how
asynchronicity should be juggled.

Furthermore, it's still not as asynchronous as it should be. There's no reason for
the getBffs and getLastNames to go out in different waves, but we introduced an
artificial dependency in the order of arguments to str. To remove the dependency,
we need to contort a bit further, explicitly launching as many of our queries as we
can, before we need them:

Now

we have only two pauses.

So even with a certain amount of the magic taken care of by asynchronous constructs,
we still have to shovel the code around to get it to behave. Not only have we not
rid ourselves of fork, but we've made deploying it correctly a central part of the
coding challenge.

A tree is a tree

Our naive implementation

has an implicit tree structure, defined by its nested parentheses.

Then, when we ran

we got a Result tree, whose edges were defined by references to
child Results in the :waiting list. Admittedly, we didn't get it all at once,
but in 3 successive passes:

After we translated

to

we get an extremely similar tree, with edges defined by asynchronous waits.
(The <n> below are supposed to indicate the nth invocation of the closure.)

In this case, there are no formal batches, but the vertical position
of the node in the diagram indicates roughly when the query actually
occurs. In practice, they may be jittered up and down a bit, and the
order in which they are executed is no longer deterministic, but if
we, like Haxl, assume that queries are referentially transparent, the
programs will return identical results.

Another difference is, as noted previously, that, with no explicit
batching, the core.async approach relies on someone else to
recognize a run of similar queries, and if the someone lives on the
server, this will cause increased network chatter. At the same time,
the Haxl style has the ironic side effect of separating computation
and querying into distinct, sequential phases, in the name of
desequentializing the queries within each phase.

In both cases, similar functional expressions were converted into
similar dependency trees, with queries occurring in similar temporal
waves. And in both cases, the waves are ordered in the same way, with the
deepest dependencies coming first, that is in
topological sort order.

We infer that the mechanics of converting a functional
expression into parallelized, asynchronous form (which so far, we've
done manually) ought to be very similar to the creation of Result
trees in a Haxl approach, and that both of them are essentially topological
sorts.

That one can achieve batching and parallelism through asynchronous code transformation
is not a novel observation. There's a 2014 paper from
Ramachandra, Chavan, Guravannavar, and Sudarshan at ITT actually called
Program Transformations for Asynchronous and Batched Query Submission.
They present a set of formal transformation rules for general procedural programs,
and refer to an implementation for Java in their DBridge
optimizer.
Additionally, about 17 minutes into Jake Donham's
Stitch talk,
he describes a possible batching strategy as "accumulating calls until you hit some threshold or something like that,"
which seems to imply that these calls are arriving asynchronously.

It's worth noting that, while the ITT optimizer allows you not just to
write batching-oblivious code, but to write exactly the same code as
you would if queries were instantaneous, this is not an essential feature of
compile-time desequentializing. It's a convenience, but a less important
step beyond the main goal of letting code express functional
intent in a logical way.

(It may have occurred to the careful reader that I am not carefully
distinguishing concurrency and parallelism. This is true but uninteresting.)

Gifts from the Gods

What is best in life - according to the original script of Conan the Barbarian -
is to discover that there's some nominally difficult thing that might
actually be easy, because you can slough off nearly all the work onto other people. Once
we generalize "doing all our queries into batches" to "doing everything in parallel", we can
rely on a few great tools in the Clojure tool-shed.

Gift I: Homoiconicity

The first set of tools comes from the fact that algorithms expressed
in lisp are vastly easier to transform than in most languages. Haxl
derivatives create a Result structure at runtime; DBridge parses
java code into data structures to which transformation rules are then
applied. Clojure programs are already a data structure, so we get to
skip a step and transform the code directly.

The transformation, as noted, is really a topological sort. Given

we simply want

to become

It's clear how this parallelizes independent argument forms; we can also see that
this depth-first recursive transformation has the effect of
hoisting nested arguments earlier in the evaluation sequence.
For a boring function of a function,

we get

which becomes

In building up the new structure, we traverse the call graph exactly once, so
the whole procedure is O(N), where N is the total number of arguments of all
functions in the program.
It seems that the Haxl procedure requires
re-generating the graph each time leaf nodes are extracted, making it
O(N log N). This reflects the fact that
the Haxl ordering is stronger than topological: not only
is every query run before its result is required (which is of course a necessary
ordering), no
query is run before the previous batch completes (which is really not
necessary).

Once you get past a few ritual incantations, it is almost trivial to produce a macro
that does our transformation:

(defmacro parallelize-func-stupid [form]
(let [[f & args] form ;; Extract function and arguments
cs (repeatedly (count args) gensym) ;; Generate some channel names
pargs (map parallelize-func-stupid args) ;; Recursively transform the args
bs (interleave cs ;; Construct the binding block
(map (fn [parg] `(a/go ~parg)) pargs))
args (map (fn [<