2016-06-27

Pivotal HDB, the Apache HadoopⓇ native SQL database powered by Apache HAWQ (incubating), has successfully passed internal testing on both the ODPi* Reference Implementation of Hadoop, as well as one of the first ODPi Runtime Compliant distributions, Hortonworks HDP v. 2.4. Pivotal HDB testing was done with no modifications to standard installation steps, making it the first big data application successfully tested to be interoperable with multiple ODPi Runtime Compliant distributions.

This is a major milestone for both Pivotal HDB and ODPi. For those unfamiliar with it, ODPi is a nonprofit organization committed to simplifying and standardizing the big data ecosystem through a common reference specification. Software that works with one ODPi Runtime Compliant distribution should need little to no re-engineering to run on other ODPi Runtime Compliant distributions. For ISVs, this means they can address a larger portion of the Hadoop install-base with a single version of their software. For end users, ODPi will help increase the breadth of technology solutions available and provide assurance that applications are interoperable across a wider range of commercial Hadoop distributions.

In this blog, I’d like to explain why ODPi is so important to the future of the big data ecosystem and detail how we conducted the internal testing of ODPi Runtime Compliant distributions with  Pivotal HDB. Hopefully this will inspire other software vendors to test interoperability of their applications with ODPi Runtime Compliant distributions.

Why ODPi Is Important To The Big Data Industry

Before ODPi and ODPi Runtime Compliant distributions were available, creating big data solutions for Hadoop was unnecessarily complicated. Each vendor’s Hadoop distribution includes its own arbitrary idiosyncrasies and vendor-specific code to enable installation and administration.

For software vendors that build products that leverage Hadoop, supporting multiple distributions often requires creating multiple versions and obtaining multiple certifications of the same software. This means redundant testing, paying extra royalties for multiple versions of third party drivers, and committing developers—who otherwise would be creating new features—to maintaining multiple versions of the same software.

This is why industry leading companies that use Hadoop have banded together to create the ODPi.  Members of ODPi collaboratively develop ODPi specifications, which reduces the need for ISVs to create and maintain customized versions of their software to support multiple ODPi Runtime Compliant distributions.

Integration Needs To Be Simpler For Hadoop

Pivotal HDB is powered by the open source project Apache HAWQ (incubating). Both databases are designed to help data analysts & data scientists to run super fast ad hoc queries using regular old familiar SQL to uncover insights in large or complex datasets.

In fact, these queries are so fast that in our internal testing HDB continues to show superior performance over other Hadoop native databases. It also shows the highest degree of SQL completeness, as only HDB is able to complete all queries in our internal tests based on the TPC-DS benchmark, and with significant speed as compared to alternatives. This is due in part to its usage of the ORCA query optimizer as well as HDB’s heritage from PostgreSQL and Pivotal Greenplum.

Pivotal HDB is a prime example of a Hadoop native technology. Pivotal defines Hadoop native as open source-based software that integrates with HDFS, YARN and other Hadoop-related projects in ways that improve the larger Hadoop ecosystem. Hadoop native technologies such as Pivotal HDB improve and extend Hadoop in natural and intuitive ways as an integrated part of a larger data infrastructure.

This is where it starts to get complex. For example, HDB delivers integration with YARN resource management, resilient and scalable storage directly in HDFS, and strong operationalization capabilities with complete deployment and management functionality using Apache Ambari. HDB also provides federated query integration with Apache Hive  via HCatalog, as well as integration with Apache Zeppelin for web-based visual analytics.  Imagine having to test all this on multiple incompatible distributions, not just one.

As a Hadoop-native software developer, Pivotal felt the pain of supporting multiple Hadoop distributions. This is why Pivotal became a founding member of ODPi—to reduce costs supporting multiple Hadoop distributions and focus investment on innovative new features for Pivotal HDB, all while still allowing as many Hadoop users as possible to access our  technology.

How We Tested ODPi Interoperability of Pivotal HDB

Below is an overview of the testing procedure we ran to test Pivotal HDB with the ODPi Reference Implementation of Hadoop as well as one of the first ODPi Runtime Compliant distributions, Hortonworks HDP v. 2.4. We tested the same version of Pivotal HDB software using the same configuration settings against both platforms. The installation was successful in both cases, and the tests finished identically.

Test

Expected Result

Install ODPi-compliant Hadoop Distribution onto cluster.

Hadoop cluster alive and functioning normally.

Install Pivotal HDB

The installation process completes without errors.

Smoke test with 1GB of data

Load 1 GB of data and run simple short queries and see that behavior operates as expected.

3TB TPC-DS-derived benchmark test with the HDB default resource manager

Execute long running  benchmark based on TPC-DS with a 3TB data set and observe the expected behavior and performance.

3TB TPC-DS-derived benchmark test with HDB configured to utilize YARN as resource manager

Execute long running benchmark based on TPC-DS with a 3TB data set with YARN resource management configured and observe the expected behavior and performance.

We designed a benchmark test based on the 99 standard TPC-DS queries and executed the benchmark using 3TB of data on 10 AWS nodes.  Each node had 64GB of RAM and only 1 large disk. The NameNode, YARN Resource Manager, and the HDB Master were co-located on DataNodes instead of having dedicated nodes.  Both Resource Managers (YARN and the default) were tested successfully.

Pivotal HDB executed all 99 queries of the benchmark test without issue. The execution time was also consistent with the performance we’ve seen at the other times we’ve run this  benchmark test.

In short, the successful testing of Pivotal HDB with ODPi Runtime Compliant distributions shows that the technical barriers for supporting multiple distributions of ODPi-compliant software are significantly lower today, thanks to the ODPi.

Why Software Vendors Should Join ODPi

In Pivotal’s experience, supporting multiple Hadoop distributions with varying operational and management functionality is a costly endeavor. Pivotal maintains broad connectivity to multiple Hadoop distributions for its open source projects Greenplum Database and Spring Framework with a degree of engineering overhead.

To achieve interoperability with Hadoop for HAWQ’s early versions, Pivotal released its own Hadoop distribution to ensure the best operating environment for HAWQ.  Now that other ODPi Runtime Compliant Hadoop distributions are available, Pivotal can spend less time supporting different Hadoop distributions and can instead focus on developing HAWQ into the best Hadoop native SQL database on the market.

Similarly, we believe other software vendors will also benefit by supporting the ODPi and testing  their software with ODPi Runtime Compliant distributions. We offer our testing procedure as one example for how to demonstrate ODPi interoperability.  We encourage software vendors to consider joining the ODPi to show their customers and Hadoop-distribution vendors that they too seek to simplify of the big data ecosystem for the benefit of all involved.

Show more