Thomaswdinsmore.com

IBM and Spark

2016-02-13

On June 15, 2015, IBM announced a major commitment to Spark. As we approach Spark Summit East, I thought it would be fun to check back and see how IBM’s accomplishments compare with the goals stated back in June.

Before we start, I’d like to note that any contribution to Spark moves the project forward, and is a good thing. Also, simply by endorsing Spark, IBM has changed the conversation. In early 2015, some analysts and journalists claimed that Spark was overhyped and “not enterprise ready.” We haven’t heard a peep from this crowd since IBM’s announcement. For that alone, IBM should get some kind of prize. :-)

In its announcement, IBM detailed six initiatives:

IBM will build Spark into the core of the company’s analytics and commerce platforms.

IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.

IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.

IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.

IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.

IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.

Let’s see where things stand.

Spark in IBM Analytics and Commerce Platforms

IBM has an expansive definition of “analytics”, reporting $17.9 billion in business analytics revenue in 2015. IDC, which tracks the market, credits IBM with $4.5 billion in business analytics software revenue in 2014. The remaining $13.4 billion, it seems is services and fluff, neither of which count when the discussion is “platforms.”

Of that $4.5 billion, the big dogs are
DataStage
InfoSphere, DB2,
Netezza
PureData System for Analytics,
Cognos
IBM Business Analytics and
SPSS
IBM Predictive Analytics — so this is where we should look when IBM says it is building Spark into its products.

Currently, Cloudant is the only IBM data source with a published Spark connector. Want to access DB2 with Spark? It’s a science project. Of course, you can always use the JDBC connector if you’re patient, but the standard is a parallel high-speed connector, like SAS has offered for years. An IBM insider tells me that there is a project underway to build a Spark connector for Netezza, which will be a good thing when it’s available.

Last October in Budapest, an IBM VP — one of 137 IBM Veeps of “Big Data” — claimed that
DataStage
InfoSphere supports Spark now. Searching documentation for InfoSphere 11.3, the most current version, produces this:

IBM appears to have discovered a new kind of product management where you build features into a product, then omit them from the documentation.

That was the approach taken for IBM Analytics Server Release 2.1, which was packaged up and shoved out the door so fast the documentation folks forgot to mention the Spark pushdown. That said, Release 2.1.0.1 is an improvement; all functions that Analytic Server can push down to MapReduce now push down to Spark, and IBM supports the product on Cloudera and MapR as well as Hortonworks and BigInsights.

It’s not clear, though, why IBM thinks that licensing Analytics Server as a separate product is a smart move. Most of the value in analytics is at the top of the stack (e.g. SPSS or Cognos,) where users can see results. Spark pushdown is rapidly becoming table stakes for analytics software; the smarter move for IBM is to bundle Analytics Server for free into SPSS, Cognos and BigInsights, to build value in those products.

It’s also curious that IBM simultaneously continues to peddle Analytics Server while donating SystemML to open source. Why not push down to Spark through SystemML?

So far as Spark is concerned, IBM is leaving SPSS Statistics users out in the cold unless they want to add Modeler and Analytics Server to the stack. For these customers, Alteryx and RapidMiner look attractive.

A search for Spark in Cognos documentation yields a big fat zero, which explains why Gartner just tossed IBM from the Leaders quadrant in BI.

Spark in IBM Watson Health Cloud

IBM insiders tell me that Watson relies heavily on Spark. Okay. Watson is a completely opaque product, so it’s impossible to verify whether IBM powers Watson with Spark or an army of trained crickets.

SystemML to Open Source

Though initially skeptical about SystemML, as I learn more about this software I’m more excited about its potential. Rather than simply building an interface to the native Spark machine learning library (MLlib), I understand that IBM has completely rebuilt the algorithms. That’s a good thing — some of the folks I know who have tried to use MLlib aren’t impressed. Without getting into the details of the issues, suffice to say that it’s good for Spark to have multiple initiatives building functional libraries on top of the Spark core.

IBM’s Fred Reiss, chief architect at IBM’s Spark Technology Center, is scheduled to present about SystemML at Spark Summit East next week.

Spark in Bluemix

IBM introduced Spark-as-as-Service in Bluemix as beta in July, 2015 and general availability in October.

The service includes Spark, Jupyter Notebooks and SWIFT object storage. It’s a bit lost in the jumble of services available in Bluemix, but the Catalogue has a handy search tool.

As of this writing, Bluemix offers Spark 1.4. Although two dot releases behind, that is competitive with Qubole Data Service. Databricks is still the best bet if you want the most up to date release.

IBM People to Spark Projects

3,500 researchers and developers. Wow! That’s a lot of butts in seats. Let’s break that down into four categories:

(1) IBM people who actively contribute to Spark.

(2) IBM developers building interfaces from IBM products to Spark.

(3) IBM developers building IBM products on top of Spark.

(4) IBM consultants building custom applications on top of Spark.

Note that of the four categories, only (1) actually moves the Spark project forward. Of course, anyone who uses Spark has the potential to contribute feedback, but ultimately someone has to cut code. While IBM tracks the Spark JIRAs to which it contributes, IBM executives could not answer a simple question: of the 248 people who contributed to Spark Release 1.6, how many work for IBM?

I suspect that most of those 3,500 researchers and developers are in categories (3) and (4).

Spark Training

Like the Million Man March, “training a million people” sounds like one of those PR-driven claims that nobody expects to take seriously, especially since it’s not time-boxed.

Anyway, the details:

AMPLab offers occasional training in the complete BDAS stack under the AMPCamp format. IBM funds AMPLab, but it does not appear that AMPLab is doing anything now that it wasn’t already doing last June.

DataCamp does not offer Spark training.

MetiStream offers public and private Spark training with a defined curriculum and service offering. The training program is certified by Databricks.

Galvanize does not offer Spark training.

Big Data University offers a two-part MOOC in Spark fundamentals.

The Big Data University courses are free, and four hours apiece, so a million enrollees is plausible, eventually at least. Interestingly, MetiStream developed the second of the two BDU courses. So the press release should read “MetiStream and IBM, but mostly MetiStream, will train a million….”