Blog.cloudera.com

New Benchmarks for SQL-on-Hadoop: Impala 1.4 Widens the Performance Gap

2014-09-22

With 1.4, Impala’s performance lead over the SQL-on-Hadoop ecosystem gets wider, especially under multi-user load.

As noted in our recent post about the Impala 2.x roadmap (“What’s Next for Impala: Focus on Advanced SQL Functionality”), Impala’s ecosystem momentum continues to accelerate, with nearly 1 million downloads since the GA of 1.0, deployment by most of Cloudera’s enterprise data hub customers, and adoption by MapR, Amazon, and Oracle as a shipping product. Furthermore, in the past few months, independent sources such as IBM Research have confirmed that “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce- or Tez-based runtime.”

Cloudera’s performance engineering team recently completed a new round of benchmark testing based on Impala 1.4 and the most recent stable releases of the major SQL engine options for the Apache Hadoop platform (see previous results for 1.1 here, and for 1.3 here). As you’ll see in the results below, Impala has extended its performance lead since our last post, especially under multi-user load:

For single-user queries, Impala is up to 13x faster than alternatives, and 6.7x faster on average.

For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average — or nearly three times faster on average for multi-user queries than for single-user ones.

Now, let’s review the details.

Configuration

As always, all tests were run on precisely the same 21-node cluster. This time around we ran on a smaller (64GB per node) memory footprint across all engines to correct a misperception that Impala only works well with lots of memory (Impala performed similarly well on larger memory clusters):

2 processors, 12 cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz

12 disk drives at 932GB each (one for the OS, the rest for HDFS)

64GB memory

Comparative Set

This time, we dropped Shark from the comparative set as it has since been retired in favor of the broad-based community initiative to speed up batch processing with Hive-on-Spark. We also added the current version of Spark SQL to the mix for those interested:

Impala 1.4.0

Hive-on-Tez: The final phase of the 18-month Stinger initiative (aka Hive 0.13)

Spark SQL 1.1: A new project that enables developers do inline calls to SQL for sampling, filtering, and aggregations as part of a Spark data processing application

Presto 0.74: Facebook’s query engine project

Queries

Just like last time, to ensure a realistic Hadoop workload with representative data-size-per-node, queries were run on a 15TB scale-factor dataset across 21 nodes.

We ran precisely the same open decision-support benchmark derived from TPC-DS described in our previous rounds of testing (with queries categorized into Interactive, Reporting, and Analytics buckets).

Due to the lack of a cost-based optimizer in all tested engines except Impala, we tested all engines with queries that had been converted to SQL-92 style joins (the same modifications used in previous rounds of testing). For consistency, we ran those same queries against Impala — although Impala produces identical results without these modifications.

We selected the most optimal file formats across all engines, consistently using Snappy compression to ensure apples-to-apples comparisons. Furthermore, each engine was assessed on a file format that ensured the best possible performance and a fair, consistent comparison: Impala on Apache Parquet (incubating), Hive-on-Tez on ORC, Presto on RCFile, and Spark SQL on Parquet.

The standard rigorous testing techniques (multiple runs, tuning, and so on) were used for each of the engines involved.

Results: Single User

Impala outperformed all alternatives on single-user workloads across all queries run. Impala’s performance advantaged ranged from 2.1x to 13.0x and on average was 6.7x faster. This result indicates Impala’s performance advantage is actually widening over time as the previous gap was 4.8x on average based on our previous round of testing. The widening of its performance advantage is most pronounced when compared to “Stinger”/Hive-on-Tez (from an average of 4.9x to 9x) and Presto (from an average of 5.3x to 7.5x):

Next, let’s turn to multi-user results. We view performance under concurrent load as the more meaningful metric for real-world comparison.

Results: Multiple Users

We re-ran the same Interactive queries as the previous post, running 10 users at the same time. Our experience was the same as the past several posts: Impala’s performance advantage widens under concurrent load. Impala’s average advantage widens from 6.7x to 18.7x when going from single user to concurrent user workloads. Advantages varied from 10.6x to 27.4x depending on the comparison. This is a bigger advantage than in our previous round of testing where our average advantage was 13x. Again, Impala is widening the gap:

Note that Impala’s speed under 10-user load was nearly half that under single-user load — whereas the average across the alternatives was just one-fifth that under single-user load. We attribute that advantage to Impala’s extremely efficient query engine, which demonstrated query throughput that is 8x to 22x better (and by an average of 14x) than alternatives:

Conclusion

Our vision for Impala is for it to become the most performant, compatible, and usable native analytic SQL engine for Hadoop. These results demonstrate Impala’s widening performance lead over alternatives, especially under multi-user workloads. We have a number of exciting performance and efficiency improvements planned so stay tuned for additional benchmark posts. (As usual, we encourage you to independently verify these results by running your own benchmarks based on the open toolkit.)

While Impala’s performance advantage has widened, the team has been hard at work adding to Impala’s functionality. Later this year, Impala 2.0 will ship with a great many feature additions including commonly requested ANSI SQL functionality (such as analytic window functions and subqueries), additional datatypes, and additional popular vendor-specific SQL extensions.

Impala occupies a unique position in the Hadoop open source ecosystem today by being the only SQL engine that offers:

Low-latency, feature rich SQL for BI users

Ability to handle highly-concurrent workloads

Efficient resource usage in a shared workload environment (via YARN)

Open formats for accessing any data from any native Hadoop engine

Low lock-in multi-vendor support, and

Broad ISV support

As always, we welcome your comments and feedback!

Justin Erickson is Director of Product Management at Cloudera.

Marcel Kornacker is Impala’s architect and the Impala tech lead at Cloudera.

Dileep Kumar is a Performance Engineer at Cloudera.

David Rorke is a Performance Engineer at Cloudera.