2013-11-14

Today for my 30 day challenge, I decided to learn how to do text and image extraction from web links using the Java programming language. This is a very common requirement in most of the content discovery websites like Prismatic. In this blog, we will learn how we can use a Java library called boilerpipe to accomplish this task.

Prerequisite

Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift supports both OpenJDK 6 and 7.

Sign up for an OpenShift Account. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

Install the rhc client tool on your machine. RHC is a ruby gem so you need to have ruby 1.8.7 or above on your machine. To install rhc, just typesudo gem install rhc
If you already have one, make sure it is the latest one. To update your rhc, execute the command
sudo gem update rhc
For additional assistance setting up the rhc command-line tool, see the following page: https://openshift.redhat.com/community/developers/rhc-client-tools-install

Setup your OpenShift account using the rhc setup command. This command will help you create a namespace and upload your ssh keys to OpenShift server.

Step1 : Create a JBoss EAP application

We will start with creating the demo application. The name of the application is newsapp.

If you have access to medium gears then you can use following command.

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. OpenShift will also setup a private git repository for us and clone the repository to the local system. Finally, OpenShift will propagate the DNS to the outside world. The application will be accessible at http://newsapp-{domain-name}.rhcloud.com/. Replace domain-name with your own unique OpenShift domain name (also sometimes called a namespace).

Step 2 : Add Maven dependencies

In the pom.xml file add the following dependency:

You will also need to add a new repository

Also update the maven project to Java 7 by updating a couple of properties in the pom.xml file:

Now update the Maven project Right click > Maven > Update Project.

Step 3 : Enable CDI

We will be using CDI for dependency injection. CDI or Context and Dependency injection is a Java EE 6 specification which enables dependency injection in a Java EE 6 project. CDI defines type-safe dependency injection mechanism for Java EE. Almost any POJO can be injected as a CDI bean.

Create a new xml file named beans.xml in the src/main/webapp/WEB-INF folder. Replace the content of beans.xml with the following:

Step 4 : Create BoilerpipeContentExtractionService

Now we can create an BoilerpipeContentExtractionService service class which will take a url and find the title and article text from it.

The above mentioned code does the following :

It first fetches the document at the given url.

Then it parses the HTML document and return TextDocument.

Next we get the title from the text document.

Finally, we extract the content from the text and return a new instance of the application value object.

Step 5 : Enable JAX-RS

To enable JAX-RS, create a class which extends javax.ws.rs.core.Application and specify the application path using the javax.ws.rs.ApplicationPath annotation as shown below.

Step 6 : Create ContentExtractionResource

Now we will create our ContentExtractionResource class which will return a content object as JSON. Create a new class named ContentExtractionResource and replace the code with the contents shown below:

Deploy to OpenShift

Finally, deploy the changes to OpenShift

After the code is pushed and the war is successfully deployed, we can view the application running at http://newsapp-{domain-name}.rhcloud.com. My sample application is running at http://newsapp-t20.rhcloud.com.

Now you can test by submitting a link in the application ui.



That's it for today. Keep giving feedback.

What's Next

Sign up for OpenShift Online

Get your own private Platform As a Service (PaaS) by evaluating OpenShift Enterprise

Need Help? Ask the OpenShift Community your questions in the forums

Showcase your awesome app in the OpenShift Developer Spotlight. Get in the OpenShift Application Gallery today.

Show more