Vision.cloudera.com

Open Standards in Apache Hadoop: Apache HBase

2015-04-29

This blog was penned by Justin Kestelyn, Alex Gutow and Michael Crutcher.

Continuing our series about balancing stability and innovation through open standards, let’s take a look at Apache HBase, the open source, NoSQL standard for Apache Hadoop. Tightly integrated with HDFS, HBase opens data stored in Hadoop to users across the company requiring real-time data access. As an integrated part of Hadoop, HBase also works closely with other popular, open standard tools such as Apache Kafka, Apache Flume, Apache Spark, and Impala, to build complete workflows for real-time applications, all within a unified platform such as Cloudera’s enterprise data hub.

Not only is HBase an open standard – with an active community driving quality and innovation, including the popular HBaseCon, and wide vendor support – it is also deployed in production across a wide range of use cases, including:

Opower, a company that provides relevant insights into household energy use to reduce consumption. For Opower, “HBase is the means by which we are able to deliver insights at scale, in real time, to any user,” says Eric Chang, Opower’s technical lead for data services

The Financial Industry Regulatory Authority, or FINRA, is an independent regulator overseen by the Securities and Exchange Commission that is obligated to monitor the markets it operates. Using HBase, as part of Cloudera’s enterprise data hub, allows them to achieve very rapid data access and reduce response times for certain complex queries by orders of magnitude.

As with any open standard, ecosystem compatibility is key and HBase integrations are no exception. Below is a look at how other players in the ecosystem view HBase and its future.

In your opinion, why has Apache HBase become a standard for real-time applications?

Cask:

Apache HBase provides real-time, random read and write access to tables while storing billions of rows and million of columns of data. This supports data intensive use cases such as Internet of Things applications, anomaly detection with massive data inputs, and in depth analysis of customer behavior. Beyond capacity, HBase also makes applications more interactive in real-time due to fast look-up latencies independent of data size. Finally, the Apache HBase community has made significant progress supporting specific properties required by real-time applications such as high availability, data consistency, and partition tolerance. Along with the technical merits of HBase, another major factor contributing toward HBase momentum in real time applications is that HBase is an intrinsic element of the broader Hadoop platform. As Hadoop is used more for real-time use cases, HBase becomes more and more of a standard.

Celer Technologies:

Real-time applications is a must in the finance industry, with valuable data being generated every second. HBase’s ability to store and access high volumes of data has not gone unnoticed in the software industry. A lot of projects are becoming aware of this fact and realising the previously untapped value and opportunities that HBase unlocks.

Cloudera:

Simply, HBase has become an open standard for real-time applications because it provides the functionality that real-time application providers need. It provides strictly consistent, extremely low-latency read and writes. It provides production-ready functionality with high availability, snapshots, and replication options. It has very mature integration with the rest of the Hadoop ecosystem and serves as the base on which a diverse set of partner products are built. It’s been battle-hardened and proven at scale, supporting many important applications that all of us interact with on a daily basis without even realizing it.

Pepperdata:

At Pepperdata, we use HBase to back the infrastructure that serves interactive customer requests, a usage pattern that we see in many of our customers, including Opower. It’s great at serving the kind of applications that are increasingly being written on Hadoop – fast, stable, and mature as it has really had the snot beaten out of it.

Splice Machine:

Apache HBase has become a standard for real-time, non-SQL applications because of fast reads and writes, high availability, strong consistency, and close integration with the Hadoop ecosystem. For SQL applications, Splice Machine extends HBase to provide ANSI SQL support, joins, secondary indexes and ACID transactions.

What is next for your organization and Apache HBase?

Cask:

HBase is a foundational technology ingredient in our flagship offering, the Cask Data Application Platform (CDAP). We provide a data abstraction layer making it easier for developers to take advantage of capabilities in HBase and other data stores such as HDFS. We are leveraging the features in HBase to support continued evolution of our platform. For example in an upcoming release of CDAP, we will deliver namespace capability to provide multi-tenancy and enhanced partitioning of CDAP datasets to improve performance on data queries. Both utilize features in HBase. Beyond utilizing HBase, we’re also contributors to the community making it a better database. Currently, we’re working on support of un-deletes, supporting a transaction engine for HBase, and additional features we expect to be committed in the core project.

Celer Technologies:

HBase has become the standard for Celer, it is the basis for all the systems that we have built. Looking forward HBase will further entrench itself into our products, as we continually find new ways to provide our clients with exciting features that only HBase has allowed.

Cloudera:

Cloudera will continue to invest in HBase as a core and widely deployed component of our open source and Enterprise offering. HBase maturity as an enterprise product has advanced rapidly in recent times with improvements to key attributes like mean time to recovery, backup and replication options, and high availability leading to demonstrably better user satisfaction. We’ll continue to invest heavily in testing and appropriate targeted features to ensure that HBase retains and improves its position as a solid basis for users to build mission critical applications on. We’re also excited about contributing to new tunable consistency options emerging in HBase that give tools to users wanting to further emphasize HBase’s already impressive low-latency performance. Additionally, we think HBase will continue to expand the diversity of applications that it’s appropriate for and look forward to helping refine new features such as the ability to store larger objects in HBase.

Pepperdata:

We’re going to keep throwing more data at our HBase installation as we grow – we’re adding several terabytes of metrics a month – and everything we see indicates the same is true for our customers.

Splice Machine:

HBase will remain the scale-out architecture for Splice Machine. We will continue to work with the HBase community to improve its performance and reliability.

As mentioned above, the development around HBase doesn’t seem to be slowing down anytime soon, especially with the recent HBase 1.0 release (included with CDH 5.4) and HBaseCon 2015 (May 7th in San Francisco) reflecting the most diverse group of operators we’ve seen yet. In addition, here’s a look at some other recent innovations:

Apache HBase 1.0 is Released

New in Cloudera Labs: Spark on HBase

New in CDH 5.2: Improvements for Running Multiple Workloads on a Single HBase Cluster

To learn more about Apache HBase, Cloudera offers the leading HBase Training to help you get started.

For more details on Cloudera’s view of open source and open standards, check out “Cloudera’s Commitment to Open Source and Open Standards,” and be on the look out for the next blog examining Impala as an open standard.