Typo3-solr.com

Index external websites with Apache Nutch

2011-09-23

Life is not always easy for search engines nowadays. They have to provide a ton of features, scale up and down or simply offer good search results.

Apache Solr is a state-of-the-art Enterprise search technology. It has proven many times that it does its job pretty well. But what if you are in need for a feature that is not supported by Solr?
Besides indexing our own TYPO3 website we also want to index external websites. Unfortunately, Apache Solr does not support this itself.
Anyway, Apache Solr integrates smoothly in the whole Apache ecosystem.

Apache Nutch is a highly scalable web crawler that has a Solr integration. In this article I will show you how to set up, configure and use Nutch with Solr.
We will use Nutch 1.3 which supports Apache Solr 3.X.

Installation

Let's get our hands dirty. We change to the directory /opt

and download Apache Nutch 1.3:

When the download is finished we unpack the archive:

Configuration

We move on to the directory nutch-1.3/runtime/local

and change the permissions of the command nutch:

Please ensure that you have set the variable JAVA_HOME. Nutch needs to know where your Java installation is located:

In the next step we configure the crawler. We open the file conf/nutch-site.xml and save the following configuration:

We create the folder urls. Nutch reads each file inside urls in order to retrieve websites to visit.
Therefore we create the file urls/seeds having the following content:

Nutch provides an own Solr schema located in conf/schema.xml. We copy the schema to our Solr installation after making a small fix.
We need to change the line

to

By setting the option stored to true we enable saving the crawled website's content in the Solr index.

Usage

After finishing the configuration we are ready to enjoy the power of Nutch.
For crawling the configured website we are using the command crawl. We call it with our running Solr server and the depth of links to follow:

The command supports the following options:

The option solr defines the used Solr server.

depths defines the depth of links to follow.

You can set a maximum of websites to crawl by using the option topN.

You can view the indexed websites using the Solr admin interface. If you do not want to search external websites via your TYPO3 website we can provide alternative solutions like
Tempo.

Conclusion

Apache Nutch provides an easy to use solution for crawling and indexing external websites. It integrates Apache Solr perfectly.
This use case proves the great flexiblity of the Apache eco system once again. If you want to search external websites using your Solr installation or are interested in an alternative solution for displaying search results get in contact with us!

Resources

Apache Solr:
http://lucene.apache.org/solr/

Apache Nutch:
http://nutch.apache.org/

Tempo with Solr:
http://tempojs.com/examples/solr