Thomaswdinsmore.com

Big Analytics Roundup (February 8, 2016)

2016-02-08

We have a bumper crop of explainers, plus lots of news on this snowy New England Monday:

IBM announces new cloud services

Gartner shakes up the BI MQ

Microsoft Ventures adds ten machine learning startups to incubator

Redis Labs releases benchmark of Spark with Redis

SAS ekes out 2% revenue growth in 2015

MSFT Azure IoT goes GA

MapR patents its proprietary bits

New Mesos release

Yes, I saw the Teradata earnings report and will blog separately on that train wreck.

On the CBInsights blog, an anonymous author reports on VC funding for AI startups; Steven Loeb summarizes. CB’s analysis is a muddle; they include RapidMiner and H2O in the AI category, but exclude Palantir, which raised $1.3 billion in 2015.

Apache Tajo, yet another SQL-on-Hadoop engine, announces a new release, which should please both of its users.

In MIS Asia, Zafirah Salim notes that a study of Linked profiles reveals that the number of people who call themselves “data scientists” has doubled in four years, thereby revealing that anyone can call themselves a “data scientist.”

Databricks’ Wayne Chan posts two customer case studies, one for Eyeview and the other for Inneractive.

Explainers

— On SlideShare, CapitalOne’s Slim Baltagi explains Flink.

— On the Data Artisans blog, Jamie Grier explains how he won Twitter Hack Week with Flink.

— At Java Code Geeks, Chase Hooley delivers an “essential guide” to Flink.

— In case you were sitting around thinking “you know, I really need to integrate NiFi and Flink”, Brian Bende explains how.

— On KDnuggets, Andy Grove (!) explains Spark RDD, DataFrame and Dataset APIs.

— In Datanami, Craig Lukasik explains deployment and runtime options for Spark, and how to get started.

— On the Databricks blog, Tathagata Das and Shixiong Zhu explain the new stateful stream processing features available in Spark 1.6.

— Also on the Databricks blog, Grega Kespret and Denny Lee explain how to do advertising analytics in Spark.

— On SlideShare, Erin LeDell explains H2O machine learning with Python. Michael Malohlava explains how to use H2O and Spark in Databricks Cloud, extending the watery metaphor.

— On the Google Cloud Big Data blog, Tyler Akidu and Frances Perry explain differences between Dataflow/Beam and Spark programming models.

— In Datamation, Ken Hess explains the differences between Spark and MapReduce, and gets it mostly right.

— On the getindata blog, Kamil Gorlo explains geospatial analytics on Hadoop.

Perspectives

— On Slideshare, Till Rohrmann of Data Artisans opines on the right way to do streaming.

— In Infoworld, David Mytton argues that Google and Microsoft have released software to open source to attract business to their cloud platforms. It’s a reasonable argument. He further notes that the open sourced version of TensorFlow runs on a single machine only, and speculates that Google will soon offer an elastic distributed TensorFlow service.

— In Motley Fool, Leo Sun largely agrees with Mytton, arguing that Microsoft released CNTK to build business for Azure. It’s an interesting argument, except that there is nothing stopping a user from installing CNTK on AWS. It only makes sense if MSFT plans to offer CNTK as a service.

(1) IBM Announces New Cloud Services

IBM announces four new services in IBM Cloud. You can read three dozen stories about this IBM announcement here, but most of them talk nonsense about Spark, which is in no way related to the four announced services:

— IBM Compose Enterprise, previously Compose, Inc, acquired by IBM last July and rolled into IBM Cloud.

— IBM Predictive Analytics: SPSS Modeler Gold in the cloud. This is a nice service for Modeler buffs. Unlike SAS, IBM understands that cloud means elastic pricing.

— IBM Analytics Exchange, which IBM describes as “an open data exchange that includes a catalog of more than 150 publicly available datasets that can be used for analysis…” The datasets come from two original data sources: the UN and the CIA. Maybe Analytics Exchange will amount to something in the future, but for now it’s kind of light.

— IBM Graph, featuring Apache Tinkerpop, the graph API nobody is asking about.

The back end for IBM Graph is open source Titan, with some modifications. (Titan is currently the third most popular graph database, according to DB-Engines.) Why not use Spark GraphX, given IBM’s commitment to Spark? IBM wants to position IBM Graph as a transactional datastore, with dynamic nodes and vertices, and its engineers believe Spark GraphX model to be better suited to static data.

IBM Graph does not leverage System G, IBM’s internally developed graph engine assets.

(2) Gartner Shakes Up BI Magic Quadrant

Gartner releases its 2016 Gartner Magic Quadrant for Business Intelligence and Analytics Platforms. If you’re a Gartner client you can get the report here, or you can get a free copy from Qlik here.

The report will be something of a shocker to some, but it recognizes how visualization and self-service BI are disrupting the industry. Vendors with successful self-service solutions do well in this report, while IT-centric vendors are punished.

Here’s the MQ, with arrows indicating movement from last year. Alteryx, Qlik and Microsoft are the biggest winners. Even Tableau gets reined in from its commanding position last year.

Biggest losers: IBM and Information Builders.

Gartner recognizes just three BI Leaders, Qlik, Tableau and Microsoft. IBM, Information Builders, MicroStrategy, Oracle, SAP and SAS are all kicked out of the Leaders quadrant:

— IBM, MicroStrategy, SAP and SAS fall from Leaders to Visionaries.

— SAS is now rated just below Alteryx, which must have folks in Cary wondering what they get for their investment in analyst relations.

— IBM is lumped together with ClearStory Data, LogiAnalytics, Pentaho, TIBCO and BeyondCore. This is equivalent to John Barrymore doing B-movies after he hit the bottle.

— Information Builders went from Leader to Niche. Too IT-centric.

— Oracle falls out of the report completely, which makes sense. Personally, I’ve never met anyone who actually uses Oracle BI.

— Actuate, Panorama, Prognoz, Salient and Targit also fall out of the report.

On the AtScale blog, BI industry vet Bruno Aziza dissects the report.

(3) Ten Machine Learning Startups to Watch

Microsoft Ventures announces ten data science and machine learning startups for its third Seattle Accelerator cohort. Four of these focus on unique tooling:

Agolo — Offers “summarization as a service” based on machine learning and natural language processing.

Clarify — Provides a service to extract meaningful features from audio or video files.

DefinedCrowd — Leverages cheap student labor to provide crowdsourced analytics.

simMachines — Delivers recommendations based on machine learning.

The remaining six focus on an application or industry:

Affinio — Uses machine learning and graph analytics to deliver insight for audience optimization.

Knomos — Delivers visual representations of complex legal concepts.

MedWhat — Uses artificial intelligence to deliver personalized health guidance.

OneBridge Solutions — Offers data-driven decision support to the pipeline industry.

Percolata — Optimizes retail staffing with store-level forecasts.

Plexuss — Social network that links prospective students and colleges.

(4) Redis Labs: Spark is Faster with Redis

Redis Labs announces integration with Spark SQL and releases a Spark-Redis connector, now available on the Spark Packages website. Redis Labs also releases a white paper summarizing results of an experiment performed with time series data. The benchmarking team forked the spark-timeseries package to use Redis Sorted Sets as a data source, then ran the use case with four execution strategies:

Uncached: data in HDFS, indices in RAM.

Tachyon: data serialized and cached, indices in RAM.

In Process: data and indices serialized and cached in RAM.

Data in Redis.

The chart below summarizes results. Not surprisingly, the use case runs faster when the data is serialized and cached in memory, as in experiments (2) and (3). Running against Redis is fastest of all.

Two comments:

— Before generalizing about Spark performance with Redis, keep in mind that time series analysis is a distinct use case. I look forward to seeing benchmarks with other use cases.

— This is a great example of Spark’s ability to operate with diverse storage back ends.

According to Dave Ramel, Redis Labs’ goal is to make Redis the de-facto data store for any Spark deployment. Well, hold on there, cupcake. Let’s see some benchmarks on other use cases before we go that far.

(5) SAS Reports Flat 2015 Revenue

Predictive analytics software leader SAS reports 2015 revenue of $3.2 billion, up 6% in constant dollars or 2% in actual dollars, the kind you can spend. In contrast, IBM reports Business Analytics revenue up 7%, to $17.9 billion. Oracle, SAP and Microsoft do not report analytics revenue separately. Consultant IDC forecast 10% growth in actual dollars for 2015. Like most people, IDC doesn’t buy into the “constant currency” BS.

This is not a bad performance, when you consider that SAS subscription fees are front-end loaded; it’s like a perpetual license, except that you don’t own the software. Renewals are 25-30% of the first-year fee so, like the Red Queen’s Race in Alice in Wonderland, SAS must run as fast as it can to stay in one place.

It’s interesting to note that for the first time this year SAS reports operating revenue. The implication is that non-operating revenue isn’t pretty.

Also interesting to note that SAS reports 8% nominal growth in first-year fees, which is a good thing. However, if first year fees are up 8% and total fees up 2%, the implication is that renewal fees are down significantly — which means people are pulling the plug on SAS.

SAS reports 80,000 software instances, or “sites” licensed; previously, SAS reported 75,000 sites. (SAS no longer reports user headcount.) The company also reports that Visual Analytics, SAS’ “Tableau Killer” now has 14,000 licenses worldwide, versus 3,400 sites last year. This is an apples to oranges comparison; in SAS parlance, sites have multiple licenses. Tableau reports 35,000 customers, up from 23,000 last January. BI accounts for 25% of SAS’ revenue, and these numbers square with Gartner’s assessment, reported above, that SAS has fallen behind industry leaders Tableau, Qlik and Microsoft.

(6) Azure IoT Goes GA

Microsoft announces general availability of Azure IoT Hub, tech press goes bananas: stories here and here.

(7) MapR Patents Converged Data Platform

MapR announces that the U.S. Patent and Trademark Office has granted a patent for its Converged Data Platform, a bundle that includes Hadoop, Spark, Drill, and other open source components plus MapR’s proprietary file system, database and streaming engine. The patent claim covers the following:

Container-based architecture with optimized replication and fault tolerance.

Consistent cluster-wide transactional read-write-update semantics.

Node-wise recovery during transactional updates.

Scaleable update techniques.

On CMSWire, Virginia Backaitis reports.

(8) New Apache Mesos Release

The Mesos team announces release 0.27.0. Key new bits:

Support for resource quota, which provides resource guarantees without tying reservations to specific agents.

Multiple disk support for reliable, high performance.

Support for implicit rules.

Improved performance of large cluster state endpoints.

Plus, more than 167 bug fixes.

55 developers contributed.