2014-05-21

Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.

In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:

HiveServer2 in Apache Hive, which provides multi-user support

Cloudera Impala, which provides interactive SQL capabilities on data in HDFS

Apache Sentry (incubating), which provides a standard mechanism for finer-grained security and access control over that data

Getting these pieces to work together seamlessly in a Kerberos-secured environment is still tricky. The most common approach is to try to connect to the cluster using Kerberos authentication, which has some challenges that make working with Hadoop different than other JDBC sources:

Client-side challenges

JDBC connections require a Kerberos session initiation. This introduces a host of complexities, such as the administrative overheads of setting up two-way cross-realm trust between LDAP and the Kerberos Key Distribution Center, or configuring your LDAP server to authenticate against the KDC.

JDBC connections to Apache Hive and Impala require referencing the principle of the system user running the HiveServer or Impala daemon in addition to the user credentials, and that system needs to be visible to the client DNS. This creates a usability problem for the end-user and is a non-starter if the client is on the other side of a DMZ.

Server-side challenges

You need all the cluster users in the KDC, or

You need one-way cross-realm trust set up between the cluster KDC and Active Directory or LDAP controller, and

You need user accounts available on every node in the cluster.

Furthermore, people often have questions about how to get JDBC working with Cloudera’s platform, which JDBC driver to use, what files and folders are needed on the client classpath (and where to get them), and the best way to test and de-bug a JDBC connection.

In this how-to, I’ll address these challenges and answer those questions, as well. (More information about tools that use ODBC is coming soon.)

LDAP Authentication from HiveServer2 or Impalad

Although it’s documented, the LDAP support in HiveServer2 and Impala deserves special emphasis: When HiveServer2 and Impala are configured with LDAP support, they allow users to connect to Kerberized clusters, with Sentry enabled and with no special client-side configuration.

Specifically, this setup is ideal because:

HiveServer2 handles authentication against LDAP and runs without impersonation enabled, so it requires no additional user management on the cluster aside from a Sentry policy file. So, setting up a test/certification environment is easy.

It requires no initialization of a Kerberos session or a kinit on the client side, reducing errors and complexity.

It requires no changes of the JDBC string on the client side, reducing errors and complexity.

If all you’re doing is reading and writing to Hive or Impala, this setup requires no additional users or shell access on the cluster.

Kerberos setup is easy because you don’t have to worry about how to set up cross-realm trust between the cluster’s KDC and the LDAP server.

Next, lets walk through an end-to-end example of setting up such an environment.

Getting it Working

You will be setting up the following:

CDH 5 cluster

Linux client system

Windows client system with JDK 7, Eclipse installed, and Active Directory

The CDH Windows client package (see link to follow)

Setting up the CDH 5 Cluster with JDBC

First, you need a cluster. (Any cluster will do, but for test/cert clusters, I like to use the AWS m2.4xlarge instance because it’s a decent approximation of a Hadoop node.) Before running the Cloudera Manager installer, run this script to prepare the nodes. The script isn’t required, but it saves time and ensures that Cloudera Manager comes up reading healthy by moving the necessary directories to instance storage, giving them enough space for health checks. From Cloudera Manager, install Hive, Impala, and Hue.

The Hue sample setup loads the sample tables you’ll be using.



Cross-platform JDBC in an Unsecure Environment

Next, spin up two additional machines for client purposes. One system is CentOS running Cloudera Manager agent and has a gateway role assigned to it. This is a baseline client system.

The other is running Windows 2012 Server and is unknown to Cloudera Manager. This will be your Active Directory server as well as your Windows JDBC client machine. The environment now looks like this:



For JDBC client testing, you’ll use two tools: Beeline and an arbitrary JDBC client. Beeline alone isn’t sufficient because it’s installed by Cloudera Manager and configured with the correct classpath and configuration, which doesn’t help debug JDBC connectivity from systems not managed by Cloudera Manager. The one you’re using here is similar to the JDBC client hosted at Apache with the following tweaks:

It queries the sample tables initialized by Hue.

It doesn’t create or load a table, making iteration easier.

It accepts JDBC URL, username, and password information on the command line – which lets you iterate without rebuilding the code.

The test cycle is to run queries over JDBC from Beeline first. This will help rule out client and server misconfigurations as well as operator error with connection strings. When the Beeline instance on a gateway node managed by Cloudera Manager can successfully query over JDBC, you can attempt to run the same query from the arbitrary client on the Windows machine.

Getting JDBC working on Windows requires you to have your classpath setup properly including select Hive/JDBC client jars as well as cluster configuration files. Cloudera doesn’t package CDH for Windows, so it’s cumbersome for customers and partners to know which jars and files to include on the classpath. To that end, I put together this CDH JDBC client package, which contains everything you need from CDH in order to make a JDBC connection. Included in the client package is the winutils.exe file, built using a free version of Microsoft Visual Studio from my Windows 2012 server, following these directions but on the source tarball of CDH. Furthermore, it’s important to generate the client configurations from Cloudera Manager and include them in your classpath. The simplest thing to do is to unzip the client configs into the same directory as the client package.

Setting Up Active Directory

Now is a good time to set up Active Directory on the Windows server. Add two users: user1@cloudera.com and user2@cloudera.com.



Testing JDBC

After the sample tables have been set up in hue, ssh to the Linux gateway and run Beeline:

For Hive testing:

 

And start running a test:

 

Also test Impala:

 

And run the same query (it’ll be a lot faster… because Impala):

 

When Hive and Impala both return success, you’ve established a baseline with JDBC and beeline.

JDBC from Windows

From the Windows server, install JDK 7 and Eclipse. Place the Windows client package on the machine somewhere, and add its contents to the Eclipse project build path along with the unpacked client configurations from Cloudera Manager. Set HADOOP_HOME to the location of the client package. Create Eclipse “run configurations” for connecting to Hive and to Impala.

 

When both run configurations return success, you’ve established a baseline with a generic JDBC client on a non-Cloudera managed Windows machine.

Now you can start securing the cluster.

Enabling Kerberos Security

First, ssh to the Cloudera Manager node, and begin to enable Kerberos security. For test/dev purposes, Eric Sammer’s krb-bootstrap script is a great place to start. This installs a KDC for you, generates the principles from Cloudera Manager, and puts them in the right place. It also gives you a krb5.conf file to distribute to the rest of the cluster.

Go ahead and install krb5-workstation on every node in the cluster, and then distribute the krb5.conf file clusterwide. This sets you up to follow the documentation, starting with Step 6.

At this point you should have a cluster with Kerberos security enabled. The JDBC client setup we’ve previously tested is now broken. In order to connect with Hive and Impala:

1. kinit the principle for the user that you’ll be using

2. Modify the JDBC connection string to refer to Kerberos security

3. Ensure the user you’re authenticating exists on every node in the cluster

(There’s still no table-level security as provided by Sentry.)

Your Linux client with Beeline looks like this after a kinit:

 

With Impala it looks largely the same:

 

At this point, you can go get client side authentication going with Windows, but the goal is to make Impala and HiveServer2 authenticate against an AD server running on Windows. You don’t want to have to change any client configuration; rather, you want to authenticate user1 and user2 from LDAP, and have HiveServer2 and/or Impala to apply the correct rules in Sentry. This is accomplished by adding the hive.server2.authentication and hive.server2.authentication.ldap.url entries to the HiveServer2 role in Cloudera Manager:

To enable LDAP authentication for Impala, you do it on the impalad command line:

You should now have authentication via LDAP; test it from Beeline by trying to login to Hive or Impala as an Active Directory user:

 

You should now be able to add a run configuration for the AD user on the Windows system.

Enabling Sentry

Now, let’s enable Sentry. You’ll create a rule where user1@cloudera.com is an admin user who can do everything, but user2@cloudera.com is a read-only user.

To enable Sentry for HiveServer2 and Impala, check the Sentry authentication box for each service in Cloudera Manager. Also, be sure to disable HiveServer2 Impersonation. When Sentry is enabled, all jobs run as a super user and then Sentry determines access based on the given credentials.

Let’s also tell Sentry to get the user group information from the policy file and not from the Linux authentication system, which in this example has no reference to the Active Directory users:

For our example, let’s use a policy file that says user1@cloudera.com is an admin, but user2@cloudera.com has read-only access. In this example, the policy file contains duplicate entries with the long and short names because of IMPALA-956.

 

At this point, you should be able to experiment with beeline or your own custom JDBC client. When authenticated as user2, you shouldn’t be able to create tables.

Debugging Sentry

Sentry logs all facts that lead up to authorization decisions at the debug level. Therefore, if you do not understand why Sentry is denying access, the best way to debug is to temporarily turning on debug logging.

That can be done by adding log4j.logger.org.apache.sentry=DEBUG to the log4j.properties file for your service. Specifically, you want to look for exceptions and messages such as:

 

which indicate each evaluation Sentry makes. The FilePermission is from the policy file while RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate FilePermissions until a match is found. If no matching privilege is found, Sentry returns false ("access denied").

Conclusion

Although the process above is a little involved, the end result is a setup that demonstrates Sentry, Kerberos Security, and JDBC access to Hive and Impala without adding any users to the cluster, and without initiating Kerberos sessions on the client side.

Secure clusters set up this way will accept JDBC connections from clients that are completely unaware of Kerberos. We encourage all partners and customers wanting to test or certify in Kerberos secured environments to take this approach.

Jeff Bean has been at Cloudera since 2010, serving on our Solutions Architect Team, Support Team, Training Team, and more recently, Partner Engineering Team.

Show more