Rittmanmead.com

Testing Oracle Direct Connector for HDFS as an Alternative to Hive ODBC for OBIEE11g

2014-02-06

In a post on the blog a couple of weeks ago, I looked at loading up a set of flight delays data into Apache Hadoop, then analysing it using Apache Hive, Cloudera Impala and OBIEE. In this scenario, OBIEE connects to the Hadoop cluster using Hive and Impala ODBC drivers, and then either Hive (through MapReduce jobs) or Impala (through its in-memory distributed query engine) returns the data to OBIEE, for display on the dashboard.

In my initial tests, as you’d expect Hive was fairly slow (as its optimised towards large (TB+) batch jobs), whilst Impala was fast, on a par with regular Oracle database queries albeit with a much more limited set of SQL functions supported. One of the questions I got asked after I posted the blog article though, was how this approach compared to accessing the Hadoop data via Oracle’s “big data connectors” – either Oracle Loader for Hadoop (OLH) or Oracle Direct Connector for HDFS? Could these give us similar performance to Hive or Impala, but make the Hadoop data more “Oracle-compatible” and suitable for querying with OBIEE?

For anyone not all that familiar with Oracle’s big data connectors, there’s two that are particularly relevant to BI/DW use-cases; Oracle Loader for Hadoop (OLH) is a bulk-loader between Hadoop and the Oracle Database that uses MapReduce on the Hadoop-side to prepare, sort and transform data before efficiently loading it into an Oracle database. There’s already lots of ways you can move data in and out of Oracle from a Hadoop source – sqoop for example – but presumably Oracle feel OLH is a particularly-efficient way of doing it, exploits Oracle direct/bulk-loading capabilities and parallelism, and so forth – but it’s a loader, not a transparent reading mechanism, so it’s only really appropriate for ETL-type situations where you want to copy the data out of Hadoop and into an Oracle data warehouse.

Oracle Direct Connector for HDFS (ODCH) is a bit different though, in that it uses the Oracle external table feature to map an Oracle data dictionary table onto HDFS files, or optionally Apache Hive tables (and therefore onto HDFS files, NoSQL databases or whatever). So you could potentially use ODCH to create Oracle database table representations of your Hive tables / HDFS files, then map these into the OBIEE repository and work with them just like any other Oracle (external) table. What this means then is that you’re not reliant on Hive or Impala ODBC drivers, or the BI Server having to generate HiveQL or ImpalaQL queries (with Impala of course not yet being officially supported), making your work from the OBIEE side a lot easier than if you’re trying to work with Hive or Impala directly.

So how well does it work then? I decided to give it a try using the new BigDataLite VM I mentioned on the blog last week, along with the Airline Delays dataset I used in the post on Impala and Hive the other week. My starting point then was a Hive database with four tables – flight performance (19m rows), dest and origin (2k rows each) and carrier (1.5k rows), like this:

Now obviously Oracle’s big data connectors aren’t installed as part of Hadoop by default, but the BigDataLite VM does have them pre-installed, along with an Oracle Database 12c database, so a lot of the work in setting things up is done for you. By default, an install of the big data connectors connects to the Hadoop environment on the same machine, so all of the setup commands I use will assume that the Hive server and so on are on the same server – localhost in this case – although of course you can configure them to connect to a remote Hadoop server or cluster.

The way ODCH works is that it uses the Oracle external table “pre-processor” feature to stream HDFS file content into your query session, with the pre-processor in this scenario being a utility provided by the big data connectors to connect the two environments together. This blog post from Oracle explains the process in a bit more detail, but the key things to understand are that it’s conceptually similar to accessing regular file data via external tables, giving us the benefit of fairly seamless access to this external data (vs. using SQL*Loader, or OLH in this case), but it’s also not “real” Oracle table data so there’s no indexes or any other performance structures that can make queries run faster – thought like regular external tables you can run ODCH loading in-parallel.

So let’s use the setup in the BigDataLite VM to create some external tables in the Oracle database that map to my Hive tables, and underlying HDFS data files. Before we do this though we need to set up a few things; first, we need to create a temporary directory on the Linux filesystem to hold the metadata files created by ODCH, and we also need to specify where the ODCH HDFS file streamer utility can be found. Logging in as oracle/welcome1 on the VM, I first create the temporary directory, and then set an environment variable the rest of the process will be looking for:

Now I log into SQL*Plus and create a database user that I’ll create the external tables in, and grant it access to the USERS tablespace:

Next, I need to create two database directory objects; one to point to where the ODCH HDFS stream pre-processor lives, and the other to point to my temporary directory:

Now, re-purposing the scripts on the BigDataLite VM provided to create external tables over the Moviework Hive database, each of my tables requires two setup files; one, a shell script, to call the ODCH utility and reference an XML parameter file, and the second, the XML file, which contains the definition of the external table. To take an example, the shell script (genloc_origin_hive.sh) to create an external table over my “origin” Hive table looks like this:

whereas the accompanying XML file (origin_hive.xml) looks like this:

Within the XML file, we specify the name of the external table that the utility will create in Oracle for you (ORIGIN_EXT_TAB_HIVE within the BI_AIRLINES schema); that it’s a Hive (rather than HDFS file) source, and the Hive database name and table name to source data from. With the operating system directory that you pass to it, the utility stores metadata within this to point to the individual HDFS files that contain your data, which means that whilst the HDFS data itself isn’t copied to Oracle, a hard reference to the file is, which means if you add more files to the Hive table’s HDFS directory, you’ll need to re-run the utility to register them – so its not completely automatic beyond this point, but more or less OK if you’ve got a single file providing the data, as I have in this slightly non-representative case.

So let’s run the shell script and create one of the external tables, passing in the Oracle database password for the “bi_airlines” user when prompted:

Note the HDFS file reference at the end – this is the metadata created by the utility, which we’d need to update if more data files are added to the underlying Hive table.

So let’s fire-up SQL*Plus and take a look at the tables:

Not bad, but just over a minute to do a quick row-count on the table. Let’s try the same thing using Hive itself – note that ODCH and Hive aren’t using the same access technique to the underlying data, and one might more sense than the other for particular query situations.

Ok, a bit longer, 76 seconds vs. 69 seconds – but as I said, they’re different access mechanisms and you’d probably get different results if you joined the data, filtered it etc.

Let’s try it in Impala now:

Ah, that’s more like it – the count this time was returned in just over a second, which is what you’d expect as Impala is optimised for ad-hoc queries, Hive for larger, batch-style SQL access and transformations. Moving over to OBIEE though, after creating a TNSNAMES connection through to the BigDataLite Oracle Database 12c database, I can import the external tables in just like any regular Oracle source:

Building the RPD against these tables is thereafter the same as any Oracle database source – no need to worry about Hive drivers, SQL dialects, Impala not being supported and so on.

But … when you come to run queries, it’s slow. As slow as accessing Hadoop data via Hive, in the order of minutes to return a result set to the dashboard.

And this is what you’d expect if you connected OBIEE to a file data source, or an Oracle external table. There’s no indexes on the underlying tables, no aggregates and so forth, and you’re not benefiting from the query optimisations you’d get with technologies such as Impala, Stinger from Hortonworks or whatever. In reality, if what you’re after is the convenience of connecting to Oracle tables rather than Hive or Impala ones, you’d just use the external tables you created to then load the data into regular Oracle tables, and just query those.

What ODCH is really for is data extraction and ETL, it’s not a high-performance ad-hoc query tool, and realistically you’d only really use it like this for initial prototyping or if your use-case really suited direct query access to HDFS file data. So – back to Impala then for this type of Hadoop access, though it’s a neat feature to be aware of, particularly in the context of ODI and bulk-loading Hadoop data into Oracle.