Blog.cloudera.com

Using Apache Hadoop and Impala with MySQL for Data Analysis

2014-04-25

Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.

I’ve used Cloudera Manager to install Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs:

Purpose

Server specs

Namenode, Hive metastore, etc + Datanodes

2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16GB RAM, RAID 10 with 8 SAS drives

Datanodes only

4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8GB RAM, single 4TB drive

As you can see those a pretty old servers; the only thing I’ve changed is added a 4TB drive to be able to store more data. Hadoop provides redundancy on the server level (it writes 3 copies of the same block to all datanodes) so we do not need RAID on the datanodes (need redundancy for namenodes thou).

Data Export

There are a couple of ways to export data from MySQL to Hadoop. For the purpose of this test I have simply exported the ontime table into a text file with:

(You can use “|” or any other symbol as a delimiter.) Alternatively, you can download data directly from www.transtats.bts.gov using this simple script:

Load into Hadoop HDFS

First thing we will need to do is to load data into HDFS as a set of files. Hive or Impala it will work with a directory to which you have imported your data and concatenate all files inside this directory. In our case it is easy to simply copy all our files into the directory inside HDFS.

Create External Table in Impala

Now, when we have all data files loaded we can create an external table:

Note the “EXTERNAL” keyword and LOCATION (LOCATION points to a directory inside HDFS, not a file). Impala will create a meta information only (will not modify the table). We can query this table right away, however, Impala will need to scan all files (full scan) for queries.

Example:

(Note that GROUP BY will not sort the rows, unlike MySQL. To sort we will need to add ORDER BY yeard.)

Explain plan:

As we can see it will scan 45GB of data.

Impala with Columnar Format and Compression

The great benefit of the Impala is that it supports columnar format and compression. I’ve tried the new Parquet format with Snappy compression codec. As our table is very wide (and de-normalized) it will help alot to use columnar format. To take advantages of the Parquet format we will need to load data into it, which is easy to do when we already have a table inside impala and files inside HDFS:

Then we can test our query against the new table:

As we can see it will scan much smaller amount of data: 3.95GB (with compression) compared to 45GB.

Results:

And the response time is much better as well.

Impala Complex Query Example

I’ve used the complex query from my previous post. I had to adapt it for use with Impala: it does not support sum(ArrDelayMinutes>30) notation but sum(if(ArrDelayMinutes>30, 1, 0) works fine.

The query is intentionally designed the way it does not take advantage of the indexes: most of the conditions will only filter out less than 30% of the data.

Impala results:

15.28 seconds is significantly faster than original MySQL results (15 min 56.40 sec without parallel execution and 5 min 47 with the parallel execution). However, this is not an “apples to apples comparison”:

MySQL will scan 45GB of data and Impala with Parquet will only scan 3.5GB.

MySQL will run on a single server, Hadoop + Impala will run in parallel on 6 servers.

Nevertheless, Hadoop + Impala shows impressive performance and ability to scale out the box, which can help a lot with the large data volume analysis.

Conclusion

Hadoop + Impala will give us an easy way to analyze large datasets using SQL with the ability to scale even on old hardware.

Alexander joined Percona in 2013. Alexander worked with MySQL since 2000 as DBA and Application Developer. Before joining Percona he was doing MySQL consulting as a principal consultant for over 7 years (started with MySQL AB in 2006, then Sun Microsystems and then Oracle).