Blog.treasuredata.com

Hadoop vs. Elasticsearch For Advanced Analytics

2015-08-31

A Tale of Two Platforms

Elasticsearch is a great tool for document indexing and powerful full text search. Its JSON based Domain Specific query Language (DSL) is simple and powerful, making it the defacto standard for search integration in any web app. But is it the best tool to manage your entire analytics pipeline? Is it really a Hadoop killer?

Let’s start by remembering the context in which an analytics system is typically built. It usually starts when the project has out grown a simple analytics tool like Mixpanel or Google Analytics, and product management’s questions are getting more difficult to answer. They’re starting to ask for things that can only be answered if you have complete control to slice and dice your raw data. So you decide it’s time to start collecting log data and built a full analytics pipeline. After a bit of research, you find that while a lot of legacy systems are build from the ground up around hadoop and the core concepts of big data management, more and more developers are starting to think about Elasticsearch for this application as well. What’s going on here? Is a search engine really the best tool for analytics? Or are we just trying to make a shoe fit because it’s comfortable?

Elasticsearch For Analytics

The open source search engine Elasticsearch has become increasingly popular over the last few years as an unexpected player in the web analytics space. Together with its open source Logstash product for server-side log tailing and its popular open source visualization tool Kibana, Elastic’s ELK analytics stack is gaining momentum for three reasons:

It is very easy to get a toy instance of Elasticsearch running with a small sample dataset.

Elasticsearch’s JSON based query language is much easier to master than more complex systems like Hadoop’s MapReduce.

Application developers are more comfortable maintaining a second Elasticsearch integration over a completely new technology stack like Hadoop.

These reasons all compelling to nascent analytics teams looking to get something up and running fast. But how does a search engine perform in comparison to a highly scalable database platform when it comes to data ingestion and complex data analysis?

Streaming Ingestion Problems

Not well on the ingestion side, it turns out. As more and more people have implemented production scale analytics platforms on Elasticsearch over the past few years, a well documented problem of packet-loss inducing split-brain as emerged. It seems that as your clusters scale up in production, they can start spanning multiple racks in a data center and experience data loss when a minor network outage breaks a connection between two or more master nodes [1][2][3][4].

Various Network Failure Modes Between Elasticsearch Nodes

Network reliability at data-centers is extremely difficult to track, but industry feedback suggests that these types of failures can be up to a daily occurrence on AWS [5]. Even though Elastic’s engineers have been working hard to address this problem, bringing the total amount of data loss during a network failure down from around 90% to comparatively negligible amounts, tests as recent as April 2015 still find that Elasticsearch instances drop data in all network failure modes evaluated [6][7].

It’s acceptable for a search tool to occasionally miss data from routinely repeatable tasks like web-crawling. Streaming analytics data, on the other hand, is non-reproducible. This means if you care about maintaining a complete analytics dataset, you should store your data in an actual database such as Hadoop, MongoDB, or Amazon Redshift, and periodically replicate it into your Elasticsearch instance for analysis. Elasticsearch on its own is not suitable as the sole system of record for your analytics pipeline.

This new persistence layer adds a significant level of complexity to what seems like an easy solution. The Logstash collector doesn’t support output to any maintstream databases other than MongoDB [8], so developers may need to substitute a more flexible collection tool like the open source project: Fluentd. Luckily, Fluentd is much easier to configure than Logstash, and supports output to almost 500 destinations, including Elasticsearch [9].

Lambda Architecture With Fluentd

Using Fluentd, developers can quickly configure a lambda architecture that sends their analytics data to both a reliable database for historical archive and Elasticsearch for analysis. Of course, even this architecture would have the same split-brain data loss issue in Elasticsearch’s ingestion, so developers looking for complete integrity in their analytics reports would need to store their data into a data lake and use a connector to periodically push an updated dataset into Elasticsearch.

Lossless Data Pipeline with Elasticsearch For Analytics

Production Resource Management

Configuring an Elasticsearch instance for stability in production is much more difficult than it seems as well. There’s a lot of trial and error involved, and a lot of settings need to be tweaked as you scale up in data volume [10].

For example: the number of shards per index must be set at the initial creation of the index, and can never be changed without creating a new one. Setting too many shards for a small dataset can create unnecessary fragmentation that degrades search performance, while choosing too few shards for a large dataset can cause your cluster to hit the shards’ maximum size limit as it grows.

To combat this problem, Shay Banon, the founder of Elasticsearch, recommends creating time-bracket indexes for streaming data, to prevent the dataset from endlessly growing [11]. This works for fast analysis of your data over a period of days and weeks, but introduces more complexities in your queries when you want to look back over a year’s worth of data spanning 26 indexes or more. It’s also creates index management headaches as your historical dataset grows and must be archived yet still remain available for querying.

Schema-Free ≠ Pain Free Uploads

You may have been lead to believe, by Hadoop or other NoSQL technologies, that Schemaless means hassle free upload of data in any key/value format. This is not the case with Elasticsearch. While you can just throw anything into it, Elastic strongly recommends you transform any data that has variations key fields values into more generic key value pairs [13]. For example:

Elastic’s Suggested JSON Transformation

It turns out that without this, Lucene will create an index for each custom key value, causing the size of your Elasticsearch instance to explode over time [14][15]. This transformation is a killer when iterating over millions of rows of historical analytics data. It also forces you to keep updating your Grok patterns in Logstash every time your system starts tracking a new event.

Time Consuming Bulk Uploads

Another painful issue when working with large datasets in Elasticsearch is its handling of bulk uploads. As mentioned before, the default buffer limit for POST is 100 Mb, which works well for uploading a small sample dataset and playing around on your terminal. But if you exceed this limit during your upload, Elasticsearch issues a silent OutOfMemory error and stops the upload. The data that was indexed before the memory error is still available for querying though, which means it can take you a long time to figure out something went wrong. [16]. Not to mention the fact that uploads can take hours, only to fail and have to be retried.

Lack of Powerful Analytics Functions

Elasticsearch’s aggregation and full-text search functions are great for answering basic web analytics questions like counts on 404 errors, pageviews, and simple demographic information. But it lacks the full power of window functions that come standard in SQL. These functions allow you to answer bigger questions such as top viewed pages broken out by country, moving averages on key metrics, or pre-trigger event traces, with a single query. Elasticsearch doesn’t support the output of query results into intermediate datasets for additional processing or analysis, nor does it support transformation of datasets, (i.e. a 1 billion row table on it’s way to becoming another 1 billion row table). Instead, your analysis is more-or-less limited to what a search tool does best: aggregate data into smaller sets according to filtering parameters [17].

Also missing are complex manipulation features like JOINs. Elasticsearch compensates for this by allowing you to set up-front alias fields on documents, for example: setting an user_name alias on each interaction event so a join with a user table isn’t required. It also supports the nesting of documents, for example: nesting click events under a user_persona document. This requires even more data pre-processing in your ETL pipeline, and forces you to specify how you’d like to interact with your data at the ingestion stage. Elasticsearch on its own does not support the full flexibility of historical analysis common in other types of datastores [18].

What About Hadoop?

How does all this stack up against Hadoop, the distributed data processing system we all know and love [19]. For starters, HDFS separates data from state in its node architecture, using one over-arching node that manages the state cluster, and several daughter nodes that store data [20]. These data nodes execute commands from their master node and log all operations in a static file. This allows a replica master to quickly recreate the state of the system without needing to talk to another master during fallback. This system is extremely fault tolerant, and prevents the split-brain scenario that causes data loss amongst masters that must communicate with each other to restore state.

Architecture Comparison: Elasticsearch v. Hadoop

Hadoop also has a broad ecosystem of tools that support bulk uploading of data, and SQL engines to support the full querying power you expect from any standard database. On the other hand, for reasons beyond the scope of this post, it can be argued that Hadoop is significantly more complex to deploy and manage in production. It requires specialized knowledge and expertise to get even a small instance up and running, and while cloud options do exist to deploy on AWS, many people still find it more straight forward to interact with Hadoop directly through their own on-premise hardware. Thus, the raw power and stability of Hadoop comes at the price of heavy setup and maintenance costs.

Hadoop’s powerful MapReduce query framework is robust enough to handle any data aggregation or transformation job, with innovative companies coming up with new applications for the system every time an engineer leaves Facebook. But mastering the intricacies of MapReduce is a high overhead for the simple operations needed in most web analytics tasks. This means an analytics systems built on Hadoop will also need to deploy a query engine layer, driven either by HiveQL or Facebook’s real-time Presto engine, so analysts can interact with the dataset using familiar SQL instead of MapReduce. These engines are incredibly powerful, but add an additional layer of complexity to the analytics infrastructure to be maintained.

Conclusion: Hadoop Still Wins

Even with these caveats, Hadoop remains the reigning champion in the analytics world. While Elasticsearch is a great tool for simple web analytics, its unforgivable sin of streaming data loss during ingestion, and arduous data ETL process make it unsuitable as the foundation of an analytics pipeline. It’s a great tool for data analytics and plug-and-play visualization, but its production scalability issues mean this technology isn’t ready for prime time.

Implementing a Hadoop instance as the backbone of an analytics system has a steep learning curve, but it’s well worth your effort. In the end, you’ll be much better off for its rock solid data ingestion and broad compatibility with a number of third party analytics tools, including Elasticsearch. If you’d like to shortcut the process and avoid having to build this all on your own, check out Treasure Data. We’re a cloud based solution that you can integrate in your web or mobile app with just a few lines of code and start capturing data instantly. We let you store and query the raw data, and help you transform it for output to many third party tools. We just announced an Elasticsearch connector that launches in a few weeks.