2016-12-30

In an earlier post, we introduced the Sentiment Analysis algorithm and showed how easy it was to retrieve the sentiment score from text content through an API call.

In this post, we’ll show how to build a sentiment analysis pipeline that grabs all the links from a web page, extracts the text content from each URL, and then returns the sentiment of each page.

While sometimes text content is easily retrieved through a database query, other times it’s not so simple to extract. For instance, it’s notoriously difficult to retrieve text content from websites, especially when you don’t want to extract everything from a URL —  just the text content of specific pages.

For example, if you want to derive the sentiment from specific pages on a website, you can easily spend hours finding an appropriate web scraper, and then weeks labeling data and training a model for sentiment analysis.

Maybe you want to find the sentiment of specific pages of your website to make sure your content is sending the right message. Or, perhaps you are trying to analyze news articles by department and sentiment.

Confused? Check out our handy introduction to sentiment analysis.

No matter what your use case is, we’ll show you how to retrieve all the URLs from a website, extract only the text, and then get the sentiment of each page’s content in just a few API calls.

Step 1: Install the Algorithmia Client

This tutorial is in Python. But, it could be built using any of the supported clients, like Javascript, Ruby, Java, etc. Here’s the Python client guide for more information on using the Algorithmia API.

Install the Algorithmia client from PyPi:

pip install algorithmia

You’ll also need a free Algorithmia account, which includes 5,000 free credits a month – more than enough to get started with crawling, extracting, and analyzing web data.

Sign up here, and then grab your API key.

Step 2: Retrieve Links from URL

The first step is to gather our list of links. We’ll use the Get Links algorithm,  takes a URL and returns an array of links found.

We’ve used the Algorithmia website as an example and here is a sample of the URLs extracted:

To start using Get Links, replace your_algorithmia_api_key with the key you got in the previous step and change the input to your URL.

Step 3: Extract Content from URLs

Next, we’ll pass the links to the  URL-2-Text algorithm, which takes the URLs returned from Get Links and then extracts the text content for each.

Since we want only the content returned from the Algorithmia blog, we’ll limit the links used to extract content to ones that start with “http://blog.algorithmia.com” and create a dictionary holding the URL and the content.

Step 4: Find the Sentiment of Text Content

In our last step, we’re going to pass the extracted text content to our sentiment analysis algorithm. This algorithm takes either a dictionary holding a single document or a list of dictionaries holding many documents, which is what we’ve done. Each document could be a word, a sentence, a paragraph, or even an article.

This algorithm returns a list of dictionaries holding each document’s sentiment and the document itself. This makes it easy to create a new list that holds the URL, content, and the sentiment of each web page that we’ve passed through our algorithms.

Here is a sample of the output with some of the text shortened for brevity:

And that’s it! You’ve now build a sentiment analysis pipeline that retrieves the links you want from a URL, extracts the text content from each page, and then finds the sentiment for each document.

Next, it might be fun to get the noun phrases from your text by using the named entities algorithm or utilize other natural language processing techniques, such as profanity detection.

Tools Used:

Get Links

URL-2-Text

Sentiment Analysis

For easy reference, find this code in our Recipes repo, or just copy the whole code snippet:

The post Building a Sentiment Analysis Pipeline for Web Scraping appeared first on Algorithmia.

Show more