2015-10-21


Appearing originally on the Typesafe blog, this article covers two emerging (and somewhat frightening) machine learning technologies for the JVM – Deeplearning4J and BIDData.

The explosive interest in building and deploying message-driven, elastic, resilient and responsive Reactive applications in enterprises continues to drive the need for real-time data streaming and instant decision-making. We see tools like Apache Spark, Cassandra, Riak, Kafka, Akka and Slick embracing this trend already. Additionally, the reality of Reactive Streams 1.0.0 is helping to pave the way for a new generation of somewhat alarming tools in the fields of Machine Learning and Deep Learning, which some believe may predicate the SkyNet takeover of Earth…

Nonetheless, there are two emerging projects in our ecosystem–Deeplearning4J andBIDData–that will get architects and developers passionate about data-centric computing. Let’s quickly go over what these upcoming areas are in plain language.

What is Machine Learning and Deep Learning?

First, let’s talk about Machine Learning, and why we should care. Machine Learning evolved from the study of pattern recognition and computational learning theory in artificial intelligence. It explores the construction and study of algorithms that can learn and executive predictive data analysis. That’s right. Think Data from Star Trek, but without the humor (or nearly-indestructible android body).

The alarming part is that you’re basically teaching computers to act without being explicitly programmed. In the past decade, Machine Learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. It’s so pervasive that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI.

Deep Learning is a cutting-edge branch of Machine Learning, which deals with a class of algorithms based on neural networks of a certain shape (i.e. deep networks). Its main goal is to help enable Machine Learning move ever-closer to AI. Deep Learning is forward-thinking, and Facebook, Google, Baidu, and Netflix are heavily invested. A small selection of recent breakthroughs using Deep Learning are:

Self-driving cars

Genomic analysis

Voice recognition

Image recognition

Image labelling

Drug discovery

One prominent feature to note is that training those systems involves a lot of computing power which is fairly specialised.  One of the components is the Graphic Processing Unit (GPU), which has been showing robust performance for this type of processing, and could serve as an alternative to CPUs in the future. The whole technology is new, but already the source of revolutionary achievements for large segments of Machine Learning problems.

If you’re interested in those new use-cases and algorithms as a developer, the following technologies aim to get you started on it.

Deeplearning4J – The Future of AI on the JVM

Deeplearning4J (DL4J) as the name implies, is Deep Learning on the JVM. As a matter of fact, it’s the only project offering a commercial-grade, open-source, distributed deep-learning library for Java and Scala.  DL4J is directed at developers, not researchers, and businesses, so it’s integrated with Hadoop and Spark.

It’s competitors are PyLearn (Python), Caffe (C++) and Torch (Lua). Just to give it a little more credibility, the main authors, Chris Nicholson & Adam Gibson, are huge Scala fans,excited about OSS, and built a library of Deep Learning software tools that are freely available to everyone. The slides above are from Adam’s talk at Scala Days 2015 Amsterdam.

BIDData – Performant GPU-based Machine Learning built with Scala

A project that takes a different direction is BIDData. It’s one among its fellows, BIDMat, BIDMach and Kylix, done in John Canny’s lab at Berkeley. Instead of being a complete library, this is more a set of general-purpose libraries for numerical and Machine Learning algorithms that deal with how to best implement them on GPUs.

The goal is to expose a Scala interface to the GPU, which lets you code for it in Scala without having to deal with how this hardware works. This library is the top performer, bar none, in many contexts: just check out these benchmarks. This library beats the performance of both Spark and other competitors, even in the most disadvantageous setups. They’ve been covered by NVIDIA’s dev blog and elsewhere, but the gist of it is that the author computes the best possible runtimes (in this case, dealing with the most data per second) and determines the shape of his implementation from that, something called a roofline design. This is an approach shared with the next best competitor, called Vowpal Wabbit (Yahoo Research, then Microsoft Research). Yet BIDMach is still faster.

But parts of the BIDData project go beyond ‘running super fast on GPUs’. The concept of Butterfly Mixing, and their implementation of it – called Kylix – also solves a fundamental problem with big data engines such as Hadoop or Spark. The communication patterns of those platforms make all machines on a cluster want to use the network at the same time, and therefore overwhelm it. Kylix achieves a better runtime by making those machines compute or transfer data on the network alternatively, in lockstep.

John Canny, perhaps unsurprisingly, is also a huge Scala fan, and coding much of BIDData in Scala. He’s done amazing work with a small team, including just a few PhD students, one of whom happens to be hearing impaired. You can see John’s recorded talk, Machine Learning at the Limit, during a Machine Learning Meetup in San Francisco.

Show more