Blog.cloudera.com

How-to: Build Re-usable Spark Programs using Spark Shell and Maven

2015-03-17

Set up your own, or even a shared, environment for doing interactive analysis of time-series data.

Although software engineering offers several methods and approaches to produce robust and reliable components, a more lightweight and flexible approach is required for data analysts—who do not build “products” per se but still need high-quality tools and components. Thus, recently, I tried to find a way to re-use existing libraries and datasets stored already in HDFS with Apache Spark.

The use case involved information flow analysis based on time series and network data. In this use case, all measured data (primary data) is stored in time series buckets, which are Hadoop SequenceFiles with keys of type Text and values of type VectorWritable (from Apache Mahout 0.9). In my testing, I found the Spark shell to be a useful tool for doing interactive data analysis for this purpose, especially since the code involved can be modularized, re-used, and even shared.

In this post, you’ll learn how to set up your own, or even a shared, environment for doing interactive data analysis of time series within the Spark shell. Instead of developing an application, you will use Scala code snippets and third-party libraries to create reusable Spark modules.

What is a Spark Module?

Using existing Java classes inside the Spark shell requires a solid deployment procedure and some dependency management. In addition to the Scala Simple Build tool (sbt), Apache Maven is really useful, too.

Figure 1: For simple and reliable usage of Java classes and complete third-party libraries, we define a Spark Module as a self-contained artifact created by Maven. This module can easily be shared by multiple users.

For this use case, you will need to create a single jar file containing all dependencies. In some cases, it is also really helpful to provide some Library wrapper tools. Such Helper classes should be well tested and documented. That way, you can achieve a kind of decoupling between data analysis and software development tasks.

Next, let’s go through the steps of creating this artifact.

Set Up a New Maven Project

First, confirm you have Java 1.7 and Maven 3 installed. Create a new directory for your projects and use Maven to prepare a project directory.

Maven will ask:

Select (6) to work with Spark 1.1 locally or another number according to the settings for your cluster.

Now, add some dependencies to the automatically generated POM file.

Mahout 0.9 libraries are also dependencies of Spark so you will need to add the scope “provided” to the dependency section—otherwise, Maven will load the library and all classes will be added to your final single jar file. (As our time-series buckets are SequenceFiles and contain objects of type VectorWritable, they require this version of Mahout.)

Another reason to package third-party libraries is for creating charts inside the Spark shell. If you have Gnuplot, it is really easy to plot results with the Scalaplot library. Just add this dependency definition to your pom.xml file and you are ready to plot:

In this specific scenario the plan is to do some interactive time-series analysis within the Spark shell. First, you’ll want to evaluate the datasets and algorithms. You have to learn more about the domain before a custom application can be built and deployed. Finally, you can use Apache Oozie actions to execute the code but even in this case all third-party libraries have to be available as one artifact.

It is worthwhile to invest some minutes in building such a single jar file—especially for projects that are more than just a hack—with all dependencies and to share this artifact among all the data scientists in your group.

But what about libraries that are not available in Maven Central–such as those on Sourceforge or Google Code?

Download and Deploy a Third-Party-Library as Part of a Spark Module

You’ll need to prepare a location for all third-party libraries that are not available via Maven Central but are required in this particular project.

Now download the required artifacts, e.g. the JIDT library from Google Code, and decompress the zip file:

Maven can deploy the artifact for you using the mvn deploy:deploy-file goal:

Now, you are ready to add this locally available library to the dependencies section of the POM file of the new Spark Module project:

The next step is to add the Maven Assembly Plugin to the plugins-section in the pom.xml file. It manages the merge procedure for all available JAR files during the build.

Use the above build snippet and place it inside the project section.

Now you are ready to run the Maven build.

The result will be a single JAR file with defined libraries built in. The file is located in the target directory. As a next step, run the Spark shell and test the settings.

Run and Test the Single-JAR Spark-Module

To run Spark in interactive mode via Spark shell, just define a variable with the name ADD_JARS. If more than one jar file should be added, use a comma-separated list of paths. Now run the Spark shell with this command:

A fast validation can be done via the Web UI of the Spark shell application. It is available on port 4040, so open this URL http://localhost:4040/environment/ in a browser for validation.

Figure 2: Validation of Spark environment settings. Jar files that are available to the Spark nodes is shown in the marked field. One has to specify all additional paths in the property spark.jars.

Another test can now be done inside the Spark shell: just import some of the required Java classes, such as the MatrixUtils, from the third-party library. You just have to type:

At this point, you may well wonder how to save your Scala code that was entered into the Spark shell. After a successful interactive session, you can simply extract your input from the Spark shell history. The Spark shell logs all commands in a file called .spark-history in the user’s home directory. Within a Linux terminal, you run the tail command to conserve the latest commands before you go on.

This command allows us to conserve the commands in a simple reusable script or as a base for further development in an IDE. Now, you can run this script file containing your Scala functions and custom code just by using the :load command. Inside the Spark shell you enter:

And don’t forget to share your code! If you want to publish this module via Github, you can quickly follow the instructions here.

Because visual investigation is an important time saver, let’s add the scalaplot library to this module. Now you can easily create some simple charts from the variables stored in the Spark shell. Because this post is not about RDDs and working with large datasets but rather about preparing the stage, follow the steps from the scalaplot documentation to plot a simple sine wave.

If your system shows a window with two waves now and no error messages appear, you are done for today.

Congratulations, the Spark shell is now aware of your project libraries, including the plotting tools and the ”legacy” libraries containing the data types used in your SequenceFiles, all bundled in your first Spark module!

Conclusion

In this post, you learned how to manage and use external dependencies (especially to Java libraries) and project specific artifacts in the Spark shell. Now it is really easy to share and distribute the modules within your data analyst working group.

Mirko Kämpf is the lead instructor for the Cloudera Administrator Training for Apache Hadoop for Cloudera University.