Thomaswdinsmore.com

Big Analytics Roundup (September 6, 2016)

2016-09-06

Jim Kyung-Soo Liew and Tamas Budavari of Johns Hopkins ask whether Tweet sentiments still predict the stock market. Short Version: they do, but the market has arbitraged away any advantage from trading on the information. So there you have it: the stock market is efficient with respect to fundamental information, technical information, and Tweets.

Enterra’s Stephen DeAngelis celebrates the “Algorithmic Economy,” which makes as much sense as a restaurateur who celebrates the “Recipe Economy.”

Jules Damji digests Spark news from the past two weeks.

Jaspreet Sandu writes a concise history of neural networks. It’s a pretty good history; however, while she mentions the exclusive-or problem, she fails to mention what solved that problem.

Google successfully plants a story in Newsweek. You’ll have to click through to get the joke.

Jody Avirgan delivers the news that internet tracking has moved beyond cookies. Actually, it did that in 2010.

Benchmarks

Nicole Hemsoth explains results of a benchmark study performed with deep learning tools by a team in Hong Kong. The team tested six problems on Caffe, CNTK, TensorFlow and Torch on multi-core desktop machines, multi-core servers and on three types of GPU.

Caffe delivered the best performance on Fully Connected Network (FCN) problems. TensorFlow performed best on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)

In most cases, GPUs delivered 10-20X better performance than any CPU-based system. For all problems and frameworks, the Nvidia GTX 1080 outperformed the GTX 980 and K80 GPUs.

Competitions

Kaggle launches the Melbourne University AES/Mathworks/NIH Seizure Prediction challenge.

Explainers

— Sital Kedia et. al. explain how they used Spark to replace Hive in an analytic app at Facebook.

— Chris Merrick explains why Spark Streaming didn’t work for Stitch. Short Version: it wasn’t an analytics use case. As Kafka moves upstream into the analytics space, we may hear more stories like this.

— On the Google Research Blog, Alex Alemu explains how to improve inception and image classification in TensorFlow.

— On YouTube, Oracle’s Brad Carlile explains performance in Spark 2.0.

— Confluent’s Neha Narkhede and dataArtisans’ Stephan Ewen explain the differences between Apache Flink and Apache Kafka Streams.

— On Linux.com Ian Murphy summarizes a presentation by Seshu Adunuthula, who explains how eBay uses Apache software for its analytics infrastructure.

— Jack Vaughan explains the new bits in Vertica 8.0, the 27th most popular database in DB-Engines’ ranking. Jack also notes that Actian has pulled the plug on the Actian Analytics Platform.

— Kathleen Casey explains Microsoft Azure’s managed services for Big Data.

— Frank Evans tries to explain the Apache Spark ecosystem but delivers a jumbled mess. Pro Tip: if you want to write about the Spark ecosystem, Spark Packages is a good start.

Perspectives

— Alex Woodie argues that fast networks eliminate the need for Hadoop-style co-located computing and storage. The argument would be compelling except that storage giant EMC reports declining product sales while Hadoop distributors grow by double digits.

— Michael Wolfe argues, persuasively, that Marketing-Mix Modeling as we know it is dead.

— Doug Henschen touts HPE’s Haven OnDemand library, a collection of old toss that HPE cannibalized from the wreck of Autonomy. It’s all window dressing for HPE’s pending fire sale of software assets.

— George Leopold opines on the impact of SAP’s rumored acquisition of Altiscale.

— Ken Gaebler complains that Big Data is too slow, touts Alluxio’s in-memory file system. He thinks it can cure cancer.

— Sam Dean touts PaddlePaddle, a deep learning framework from Baidu.

Open Source News

— Facebook announces MyRocks, an open source project that integrates RocksDB as a storage engine for MySQL. Serdar Yegulalp reports.

— Apache Drill releases Drill 1.8.0, with a few minor enhancements.

Commercial Announcements

— Databricks announces Notebook Workflows, a set of APIs enabling users to combine notebooks into a production pipeline.

— Splice Machine names a new CMO.

— Huawei and Alluxio announce a high-performance storage solution that integrates Huawei’s FusionStorage and Alluxio’s in-memory file system.