R-bloggers.com

“Just the text ma’am” – Web Site Content Extraction with XSLT & R

2015-07-09

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

Sometimes you just need the salient text from a web site, often as a first step towards natural language processing (NLP) or classification. There are many ways to achieve this, but XSLT (eXtensible Stylesheet Language) was purpose-built for slicing, dicing and transforming XML (and, hence, HTML) so, it can make more sense and even be speedier use XSLT transformations than to a write a hefty bit of R (or other language) code.

R has had XSLT processing capabilities in the past. Sxslt and SXalan both provided extensive XSLT/XML processing capabilities, and Carl Boettiger (@cboettig) has resurrected Sxslt on github. However, it has some legacy memory bugs (just like the XML package does and said bugs were there long before Carl did his reanimation) and is a bit more heavyweight than at least I needed.

Thus, xslt was born. It’s based libxml2 and libxslt so it plays nicely with xml2 and partially wraps xmlwrapp, which is, itself, a C++ wrapper for libxml2 and libxslt.

The github page for the package has installation instructions (you’ll need to be somewhat adventureous until the package matures a bit), but I wanted to demonstrate the utility before refining it.

Using XSLT in Data Analyis Workflows

At work, we maintain an ever-increasing list of public breaches known as the Veris Community Database – VCDB. Each breach is a github issue and we store links to news stories (et al) that document or report the breach in each issue. Coding breaches is pretty labor-intensive work and we have not really received a ton of volunteers (the “C” in “VCDB” stands for “Community”), so we’ve been looking at ways to at least auto-classify the breaches and get some details from them programmatically. This means that getting just the salient text from these news stories/reports is critical.

With the xslt package, we can use an XSLT tranformation (that XSLT file is a bit big, mostly due to my XSLT being rusty) in an rvest/xml2 pipeline to extract just the text.

Here’s a sample of it in action with apologies for the somewhat large text chunks:

(those are links from three recent breaches posted to VCDB).

Those operations are also pretty fast:

(more benchmarks that exclude the randomness of download speeds will be forthcoming).

Rather than focus on handling tags, attributes and doing some fancy footwork with regular expressions (like all the various readability ports do), you get to focus on the data analysis pipeline, with text that’s pretty clean (you can see it misses some things) and also pretty much ready for LDA or other text analysis.

The xmlwrapp C++ library doesn’t have much functionality beyond the transformation function, so there may not be much more added to this package. There is one extra option—to pass parameters to XSLT transformation scripts—that will be coded up in short order.

If you find a use for xslt (or a bug) drop us a note here or on github.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...