2014-08-03

I haven't crawled using Python before - we use Apache Nutch (NutchGORA actually) for large crawls, and its been a while since I had to do small focused crawls. At that time, my tool of choice was WebSPHINX, a small but powerful Java library for building crawlers.

We recently needed to crawl a small site, and I had heard good things about ScraPy, a Python toolkit used for building crawlers and crawl pipelines, so I figured this may be a good opportunity to learn how to use it. This post describes my experience, as well as the resulting code. Overall, this is what I did.

Create a Scrapy crawler and download all the pages as HTML, as well as some document metadata. This writes to a single large JSON file.

Pull out the HTML from the JSON into multiple HTML documents, one HTML file for each web page.

Parse out the HTML and merge all metadata back into individual JSON files, one JSON per document.

I installed Scrapy using apt-get based on the advice on this page. Earlier, I had tried using "pip install" but it failed with unknown libffl errors.

Once installed, the first thing to do is create a scrapy project. Unlike most other Python modules, scrapy is both a toolkit as well as a library. The following command creates a stub project.

Based on the files in the project stub, I am pretty sure that the three steps I show above could have been done in one shot, but for convenience I just used Scrapy as a crawler, deliberately keeping the interaction with the target site as minimal as possible. I figured that fewer steps translate to fewer errors, and thus less chance of having to run this step multiple times. Once the pages were crawled, I could then iterate as many times as needed against the local files without bothering the website.

Scrapy needs an Item and a Spider implementation. The Spider's parse() method yields Request objects (for outlinks in the page) or Item objects (representing the document being crawled). The default output mechanism is to capture the results of the crawl as a List of List of Items.

The Item class is defined inside items.py. After the change the items.py looks like this:

And here is the code for the Spider. Because the primary use case for Scrapy appears to be scraping single web pages for interesting contents, the tutorial doesn't provide much help for multi-page crawls. My code is heavily adapted from this post from Milinda Pathirage.

The site consists of two kinds of pages that we are interested in - the browse.asp refers to directory style pages and sample.asp refers to pages representing actual documents. The code above looks for outlinks in each page as it comes to it, and if the URL contains browse.asp or sample.asp, then it creates a Request for the crawler to crawl. Otherwise if it encounters a page returned by sample.asp it saves the output (along with some metadata) to the crawler output. We haven't specified a maximum depth to crawl - since a sample is uniquely specified by its sample_name and type_name, we maintain a set of ((sample_name, tuple_name)) values crawled so far. The crawler is run using the following command:

This brings back approximately 5,169 items. One issue with the results.json is that scrapy forgets to put in a terminating "]" character - it could be a bug or something about my environment. In any case, I was unable to parse this file (and view it in Chrome) until I added a terminating "]" character. The data looks something like this:

A nice convenience is Scrapy's REPL (a customized Python REPL), which allows you to test out your XPaths against live pages. I used it here as well as later to parse the HTML in the files. You can invoke it like so:

The next step is to extract the HTML from the large JSON file into multiple small files, one document per file. We do this using this simple program below:

The program above reads the crawled data and writes out a directory structure organized as type_name/sample_name. I check for uniqueness of content by calculating the md5 digest of the contents and I find that there are 5,117 unique documents. However, because the same document can be arrived at from different paths, and presumably they differ in HTML markup slightly, the actual number of unique documents across type and sample is 4,835.

We then parse the HTMLs and the directory metadata back to a flat JSON format, one file per sample. There are only 2,224 unique files because the same document can be mapped to multiple categories.

As mentioned earlier, the site is completely dynamic and renders using ASP files. The objective of the code above is to be able to extract the dynamic block of text from the page template. Unfortunately, there does not seem to be a single way of recognizing this block. The three heuristics I used was to check for a regular expression that seems to begin a majority of the texts (in this case the text did not contain line breaks), to check for the contents of a div block with the "text-align" style attribute, and finally to check for lines of longer than 700 characters. The last 2 are true also for disclaimer text (which is constant across all the pages on the site) so I use a phrase from the disclaimer to eliminate text blocks from there. For this I use Scrapy's XPATH API as well as some plain old regex hacking. An output file looks like the following now.

And thats all I have today. The next step is to analyze these files - I will share if I learn something new or find something interesting. By the way, if you are wondering about the Pig Latin in the examples, its deliberate and done to protect the website I was crawling for this work. The actual text for these examples was generated by this tool.

Show more