0xdata.com

Hacking Algorithms into H2O: Quantiles

2014-11-11

This is a presentation of hacking a simple algorithm into the new dev-friendly
branch of H2O, h2o-dev.

This one of two “Hacking Algorithms into H2O” blogs. Both blogs
start out the same - getting the h2o-dev code and building it. They are the
same until the section titled Building Our Algorithm: Copying from the
Example, and then the content is customized for each
algorithm. This blog describes computing Quantiles.

What is H2O-dev ?

As I mentioned H2O-dev is a dev-friendly version of H2O, and is soon to be our
only version. What does “dev-friendly” mean? It means:

No classloader: The classloader made H2O very hard to embed in
other projects. No more!
Witness H2O’s embedding in Spark.

Fully integrated into IdeaJ: You can right-click debug-as-junit any of
the junit tests and they will Do The Right Thing in your IDE.

Fully gradle-ized and maven-ized: Running gradlew build will download
all dependencies, build the project, and run the tests.

These are all external points. However, the code has undergone a major
revision internally as well. What worked well was left alone, but what
was… gruesome… has been rewritten. In particular, it’s now much easier to
write the “glue” wrappers around algorithms and get the full GUI, R, REST &
JSON support on your new algorithm. You still have to write the math, of
course, but there’s not nearly so much pain on the top-level integration.

At some point we’ll flip the usual H2O
github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does
not contain all the functionality in H2O, so it is in its own repo.

Building H2O-dev

I assume you are familiar with basic Java development and how github
repo’s work - so we’ll start with a clean github repo of h2o-dev:

This will download the h2o-dev source base; took about 30secs for me from home
onto my old-school Windows 7 box. Then do an initial build:

The first build took about 12mins, including all the test runs. Incremental
gradle-based builds are somewhat faster:

But faster yet will be IDE-based builds. There’s also a functioning Makefile
setup for old-schoolers like me; it’s a lot faster than gradle for incremental
builds.

While that build is going, let’s look at what we got. There are 4 top-level
directories of interest here:

h2o-core: The core H2O system - including clustering, clouding,
distributed execution, distributed Key-Value store, the web, REST and JSON
interfaces. We’ll be looking at the code and javadocs in here - there are a
lot of useful utilities - but not changing it.

h2o-algos: Where most of the algorithms lie, including GLM and Deep
Learning. We’ll be copying the Example algorithm and turning it into
a quantiles algorithm.

h2o-web: The web interface and JavaScript. We will use jar files
from here in our project, but probably not need to look at the code.

h2o-app: A tiny sample Application which drives h2o-core and h2o-algos,
including the one we hack in. We’ll add one line here to teach H2O about our
new algorithm.

Within each top-level directory, there is a fairly straightforward maven’ized
directory structure:

In the Java directories, we further use water directories to hold core H2O
functionality and hex directories to hold algorithms and math:

Ok, let’s setup our IDE. For me, since I’m using the default IntelliJ IDEA setup:

Running H2O-dev Tests in an IDE

Then I switched to IDEAJ from my command window. I launched IDEAJ, selected
“Open Project”, navigated to the h2o-dev/ directory and pressed Open. After IDEAJ opened, I pressed the Make project button (or Build/Make Project or ctrl-F9) and
after a few seconds, IDEAJ reports the project is built (with a few dozen
warnings).

Let’s use IDEAJ to run the JUnit test for the Example algorithm I mentioned
above. Navigate to the ExampleTest.java file. I used a quick double-press of
Shift to bring the generic project search, then typed some of
ExampleTest.java and selected it from the picker. Inside the one obvious
testIris() function, right-click and select Debug testIris(). The testIris code
should run, pass pretty quickly, and generate some output:

Ok, that’s a pretty big pile of output - but buried it in is some cool stuff we’ll need to be able to pick out later, so let’s break it down a little.

The yellow stuff is H2O booting up a
cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster
setup problems, including the git build version info, memory assigned to the
JVM, and the network ports found and selected for cluster communitcation. This
section ends with the line:

This tells us we formed a Cloud of size 1: one JVM will be running our program,
and its IP address is given.

The lightblue stuff is our
ExampleTest JUnit test starting up and loading some test data (the venerable
iris dataset with headers, stored in the H2O-dev repo’s
smalldata/iris/ directory). The printout includes some basic stats about the
loaded data (column header names, min/max values, compression ratios).
Included in this output are the lines Start Parse and Done Parse. These
come directly from the System.out.println("Start Parse") lines we can see in
the ExampleTest.java code.

Finally, the green stuff is our Example
algorithm running on the test data. It is a very simple algorithm (finds the
max per column, and does it again and again, once per requested _max_iters).

Building Our Algorithm: Copying from the Example

Now let’s get our own algorithm framework to start playing with in place.

We want to compute Quantiles, so lets call the code quantile. I cloned the
main code and model from the h2o-algos/src/main/java/hex/example/ directory
into h2o-algos/src/main/java/hex/quantile/, and also the test from
h2o-algos/src/test/java/hex/example/ directory into
h2o-algos/src/test/java/hex/quantile/.

Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas
with Example in the name (ExampleHandler.java, ExampleModelV2.java,
ExampleV2) to their Quantile* variants.

I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java
file to its Quantile variant. Finally I renamed files and file contents
from Example to Quantile.

I also dove into h2o-app/src/main/java/water/H2OApp.java and copied the two
Example lines and made their Quantile variants. Because I’m old-school,
I did this with a combination of shell hacking and Emacs; about 5 minutes all
told.

At this point, back in IDEAJ, I nagivated to QuantileTest.java, right-clicked
debug-test testIris again - and was rewarded with my Quantile clone running
a basic test. Not a very good Quantile, but definitely a start.

Whats in All Those Files?

What’s in all those files? Mainly there is a Model and a ModelBuilder, and
then some support files.

A model is a mathematical representation of the world, an effort to
approximate some interesting fact with numbers. It is a static concrete
unchanging thing, completely defined by the rules (algorithm) and data used
to make it.

A model-builder builds a model; it is transient and active. It exists as
long as we are actively trying to make a model, and is thrown away once we
have the model in-hand.

In our case, we want quantiles as a result - a mathematical representation of
the world - so that belongs in the QuantileModel.java file. The algorithm
to compute quantiles belongs in the QuantileModelBuilder.java file.

We also split Schemas from Models - to isolate slow-moving external APIs from
rapidly-moving internal APIs: as a Java dev you can hack the guts of Quantile
to your hearts content - including the inputs and outputs - as long as the
externally facing V2 schemas do not change. If you want to report new stuff or
take new parameters, you can make a new V3 schema - which is not compatible
with V2 - for the new stuff. Old external V2 users will not be affected by
your changes (you’ll still have to make the correct mappings in the V2 schema
code from your V3 algorithm).

One other important hack: quantiles is an unsupervised algorithm - no training
data (no “response”) tells it what the results “should” be. So we need to hack
the word Supervised out of all the various class names it appears in. After
this is done, your QuantileTest probably fails to compile, because it is trying
to set the response column name in the test, and unsupervised models do not
get a response to train with. Just delete the line for now:

At this point we can run our test code again (still finding the max-per-column).

The Quantile Model

The Quantile model, in the file QuantileModel.java, should contain what we
expect out of quantiles: the quantile value, a single double. Usually people
request a set of Q probilities (e.g., the 1%, 5%, 25% and 50% probabilities),
so we’ll report a double[/*Q*/] of quantiles. Also, we’ll report them for
all N columns in the dataset, so really a double[/*N*/][/*Q*/]. For our Q
probabilities, this will be:

Inside the QuantileModel class, there is a class for the model’s output:
class QuantileOutput. We’ll put our quantiles there. The various support
classes and files will make sure our model’s output appears in the correct REST &
JSON responses, and gets pretty-printed in the GUI. There is also the
left-over _maxs array from the old Example code; we can delete that now.

And finally a quick report on the effort used to build the model: the number
of iterations we actually took. Our quantiles will run in iterations,
improving with each iteration until we get the exact answer. I’ll describe the
algorithm in detail below, but it’s basically an aggressive
radix sort.

My final QuantileOutput class looks like:

Now, let’s turn to the input to our model-building process. These are stored
in the class QuantileModel.QuantileParameters. We already inherit an input
training dataset (returned with train()), and some other helpers (e.g. which
columns to ignore if we do an expensive scoring step with each iteration). For
now, we can ignore everything except the input dataset from train().

However, we want some more parameters for quantiles: the set of probabilities.
Lets also put in some default probabilities (the GUI & REST/JSON layer will let
these be overridden, but it’s nice to have some sane defaults). Define it next
to the left-over _max_iters from the old Example code (which we might as well
also nuke):

My final QuantileParameters class looks like:

A bit on field naming: I always use a leading underscore _ before all
internal field names - it lets me know at a glance whether I’m looking at a
field name (stateful, can changed by other threads) or a function parameter
(functional, private). The distinction becomes interesting when you are
sorting through large piles of code. There’s no other fundamental reason to
use (or not) the underscores. External APIs, on the other hand, generally do
not care for leading underscores. Our JSON output and REST URLs will strip the
underscores from these fields.

To make the GUI functional, I need to add my new probabilities field to the external schema
in h2o-algos/src/main/java/hex/schemas/QuantileV2.java:

And I need to add my result fields to the external output schema in schema in
h2o-algos/src/main/java/hex/schemas/QuantileModelV2.java:

The Quantile Model Builder

Let’s turn to the Quantile model builder, which includes some boilerplate we
inherited from the old Example code, and a place to put our real algorithm.
There is a basic Quantile constructor which calls init:

In this case, the init(false) means “only do cheap stuff in init“. Init is
defined a little ways down and does basic (cheap) argument checking.
init(false) is called every time the mouse clicks in the GUI and is used to
let the front-end sanity parameters function as people type. In this case
“only do cheap stuff” really means “only do stuff you don’t mind waiting on
while clicking in the browser”. No computing quantiles in the init() call!

Speaking of the init() call, the one we got from the old Example code limits
sanity checks the now-deleted _max_iters. Let’s add some lines to check that
our _probs are sane:

In the Quantile.java file there is a trainModel call that is used when you
really want to start running quantiles (as opposed to just checking arguments).
In our case, the old boilerplate starts a QuantileDriver in a background
thread. Not required, but for any long-running algorithm it is nice to have it
run in the background. We’ll get progress reports from the GUI (and from
REST/JSON) with the option to cancel the job, or inspect partial results as the
model builds.

The class QuantileDriver holds the algorithmic meat. The compute2() call
will be called by a background Fork/Join worker thread to drive all the hard
work. Again, there is some brief boilerplate we need to go over.

First up: we need to record Keys stored in H2O’s DKV: Distributed
Key/Value store, so a later cleanup, Scope.exit();, will wipe out any temp
keys. When working with Big Data we have to be careful to clean up after
ourselves - or we can swamp memory with Big Temps.

Next: we need to prevent the input datasets from being manipulated by other
threads during the model-build process:

Locking prevents situations like accidentally deleting or loading a new dataset
with the same name while quantiles is running. Like the Scope.exit() above,
we will unlock in finally block. While it might be nice to use Java locking,
or even JDK 5.0 locks, we need a distributed lock, which is not provided by
JDK 5.0. Note that H2O locks are strictly cooperative - we cannot
enforce locking at the JVM level like the JVM does.

Next, we make an instance of our model object (with no clusters yet) and place
it in the DKV, locked (e.g., to prevent another user from overwriting our
model-in-progress with an unrelated model).

Also, near the file bottom is a leftover class Max from the old Example code.
Might as well nuke it now.

The Quantile Main Algorithm

Finally we get to where the Math is!

……………………

Quantiles starts with some clusters, generally picked from the dataset
population, and then optimizes the cluster centers and cluster
assignments. The easiest (but not the best!) way to pick clusters is just to
pick points at (pseudo) random. So ahead of our iteration/main loop, let’s pick
some clusters.

My Quantile now has a leftover loop from the old Example code running up to
some max iteration count. This sounds like a good start to computing quantiles - we’ll
need several stopping conditions, and max-iterations is one of them.

Let’s also stop if somebody clicks the “cancel” button in the GUI:

I removed the “compute Max” code from the old Example code in the loop body.
Next up, I see code to record any new model (e.g. clusters, mse), and save
the results back into the DKV, bump the progress bar, and log a little bit of
progress:

The Quantile Main Loop

And now we need to figure what do in our main loop. Somewhere between the
loop-top-isRunning check and the model.update() call, we need to compute
something to update our model with! This is the meat of Quantile - for each
point, assign it to the nearest cluster center, then compute new cluster
centers from the assigned points, and iterate until the clusters quit moving.

Anything that starts out with the words “for each point” when you have a billion
points needs to run in-parallel and scale-out to have a chance of completing
fast - and this is exactly H2O is built for! So let’s write code that runs
scale-out for-each-point… and the easiest way to do that is with an H2O
Map/Reduce job - an instance of MRTask. For Quantile, this is an instance of
Lloyd’s basic algorithm. We’ll call it from the main-loop like this, and
define it below (extra lines included so you can see how it fits):

Basically, we just called some not-yet-defined Lloyds code, computed some
cluster centers by computing the average point from the points in the new
cluster, and copied the results into our model. I also printed out the Mean
Squared Error and row counts, so we can watch the progress over time. Now
class Lloyds can be coded as an inner class to the QuantileDriver class:

A Quick H2O Map/Reduce Diversion

This isn’t your Hadoop-Daddy’s Map/Reduce. This is an in-memory super-fast
map-reduce… where “super-fast” generally means “memory bandwidth limited”,
often 1000x faster than the usual hadoop-variant - MRTasks can often touch a
gigabyte of data in a millisecond, or a terabyte in a second (depending on how much
hardware is in your cluster - more hardware is faster for the same amount of
data!)

The map() call takes data in Chunks - where each Chunk is basically a
small array-like slice of the Big Data. Data in Chunks is accessed with basic
at0 and set0 calls (vs accessing data in Vecs with at and set). The
output of a map() is stored in the Lloyds object itself, as a Plain Olde
Java Object (POJO). Each map() call has private access to its own fields and
Chunks, which implies there are lots of instances of Lloyds objects
scattered all over the cluster (one such instance per Chunk of data…
well, actually one instance per call to map(), but each map call is handed an
aligned set of Chunks, one per feature or column in the dataset).

Since there are lots of little Lloyds running about, their results need to be
combined. That’s what reduce does - combine two Lloyds into one.
Typically, you can do this by adding similar fields together - often array elements are
added side-by-side, similar to a saxpy operation.

All code here is written in a single-threaded style, even as it runs in
parallel and distributed. H2O handles all the synchronization
issues.

A Quick Helper

A common op is to compute the distance between two points. We’ll compute
it as the squared Euclidean distance (squared so as to avoid an expensive
square-root operation):

Lloyds

Back to Lloyds, we loop over the Chunk[] of data handed the map call by
moving it as a double[] for easy handling:

Then we need to find the nearest cluster center:

And then add the point into our growing pile of points in the new clusters
we’re building:

And that ends the map call and the Lloyds main work loop. To recap, here it
is all at once:

The reduce needs to fold together the returned results; the _sums, the
_rows and the _se:

And that completes Quantile!

Running Quantile

Running the QuantileTest returns:

You can see the Mean Squared Error dropping over time, and the clusters stabilizing.

At this point there is a zillion ways we could take this Quantiles:

Report MSE on the validation set

Report MSE per-cluster (some clusters will be ‘tighter’ than others)

Run K-Means with different seeds, which will likely build different
clusters - and report the best cluster, or even sort and group them by MSE.

Handle categoricals (e.g. compute Manhattan distance instead of Euclidean)

Normalize the data (optionally, on a flag). Without normalization features,
larger absolute values will have larger distances (errors) and will get
priority of other features.

Aggressively split high-variance clusters to see if we can get K-Means out of
a local minima

Handle clusters that “run dry” (zero rows), possibly by splitting the point
with the most error/distance out from its cluster to make a new one.

I’m sure you can think of lots more ways to extend K-Means!

Good luck with your own H2O algorithm,

Cliff