2014-10-01

Real-time analysis in Hadoop moved a step closer to enterprise reality on Monday after the Apache Software Foundation promoted Storm to a top-level project. The status upgrade charts a future for the continued development of the platform under the same community umbrella that governs the batch processing framework and its sprawling component ecosystem.

Like most of the other technologies in the extended Hadoop family, Apache Storm started as an internal project at a web company that encountered a challenge no existing solution could adequately address. In this particular case, the firm in question was a little-known social media marketing startup known as BackType Inc., which momentarily appeared on the industry radar after becoming part of Twitter Inc. in August 2011.

The company had already open-sourced much of its homegrown analytics technology by the time of the acquisition, namely the ElephantDB key-value store for exporting information from Hadoop and Cascalog, a library designed to simplify work in the data processing framework. A week after announcing the purchase, Twitter revealed that it would continue BackType’s community-giving tradition and release the source code for the last remaining ace up the startup’s sleeve under a free license. The rest is history.

Storm became an Apache Incubator process last September and garnered the support of several industry heavy hitters over the course of its 12-month induction process, including Cisco Systems Inc., engineering powerhouse PARC and Yahoo Inc., the birthplace of Hadoop. Along the way, the real-time event processor has been enhanced with features such as support for the YARN resource management and scheduling tool, which enables different types of workloads to run in the same deployment so as to make fast-paced stream analysis more practical from a cost standpoint.

Hadoop distributor Hortonworks Inc. – itself a supporter of Storm – only recently started working on integrating YARN with Spark, the other real-time data crunching engine that has been making waves in the ecosystem lately. Yet while there are significant technological overlaps between the projects, they’re built for two fundamentally different tasks.

The product of UC Berkley’s AMPLab, Spark is designed to reduce processing times for the traditional batch workloads Hadoop was built to handle. Storm, in contrast, operationalizes the framework to analyze fast-moving data streams such as tweets and sensory input. The technology, therefore, fills a much bigger functionality gap in the upstream component ecosystem that makes its graduation especially significant for the future development of Hadoop. Expect more and more vendors to start integrating their products with the project and contribute code now that it has reached stable ground.

photo credit: Striking Photography by Bo Insogna via photopin cc

Powered by WPeMatico

Show more