Who swears more? Do Twitter users who mention Donald Trump swear more than those who mention Hillary Clinton? Let’s find out by taking a natural language processing approach (or, NLP for short) to analyzing tweets.
This walkthrough will provide a basic introduction to help developers of all background and abilities get started with the NLP microservices available on Algorithmia. We’ll show you how to chain them together to perform light analysis on unstructured text. Unfamiliar with NLP? Our gentle introduction to NLP will help you get started.
We know that getting started with a new platform or developer tool is an investment in time and energy. Sometimes it can be hard to find the information you need in order to start exploring on your own. That’s why we’ve centralized all our information in the Algorithmia Developer Center and API Docs, where users will find helpful hints, code snippets, and getting started guides. These guides are designed to help developers integrate algorithms into applications and projects, learn how to host their trained machine learning models, or build their own algorithms for others to use via an API endpoint.
Now, let’s tackle a project using some algorithms to retrieve content, and analyze it using NLP. What better place to start than Twitter, and analyzing our favorite presidential candidates?
Twitter, Trump, and Profanity: An NLP Approach
First, let’s find the Twitter-related algorithms on Algorithmia. Go to the search bar on top of the navigation and type in “Twitter”:
You’ll get quite a few results, but find the one called Retrieve Tweets with Keyword, and check out the algorithm page where it will tell you such information as the algorithm’s description, pricing, and the permissions set for this algorithm:
If you are interested in learning more regarding the basics of the platform including the algorithm profile page visit the Developer Center’s Basic Guides section.
The algorithm description provides information about the input and output data structures expected, as well as the details regarding any other requirements. For instance, Retrieve Tweets with Keyword requires your Twitter API authentication keys.
At the bottom section of every algorithm page we provide the code samples for your input, output, and how to call the algorithm in Python, Rust, Ruby, JavaScript, NodeJS, cURL, CLI, Java, or Scala. If you have questions about the details of using the Algorithmia API check out the API docs.
Alright, let’s get started!
Here’s the overall structure of our project:
You’ll need a free Algorithmia account to complete this project. Sign up for free and receive an extra 10,000 credits. Overall, the project will consist of processing around 700 tweets or so with emoticons and other special characters stripped out. This means if a tweet only contained URL’s and emoticons then it won’t be analyzed. Once we pull our data from the Twitter API, we’ll clean it up with some regex, remove stop words, and then find our swear words.
Step One: Retrieve Tweets by Keyword
We’ll use the Retrieve Tweets by Keyword algorithm first in order to query tweets from the Twitter Search API:
Okay, let’s go over the obvious parts of the code snippet. This algorithm takes a nested dictionary called ‘input’ that contains the keys: ‘query’, ‘numTweets’ and ‘auth’ which is a dictionary itself. The key ‘query’ is set as a global variable called q_input and holds the system argument that is passed when executing the script. In our case it will hold a presidential nominee name. The key ‘numTweets’ is set to the number of tweets you want to extract and the dictionary ‘auth’ holds the Twitter authentication keys and tokens that you got from Twitter.
As you write the pull_tweets() function, pay attention to the line that sets the variable ‘client’ to ‘Algorithmia.client(algorithmia_api_key)’. This is where you pass in your API key that you were assigned when you signed up for an account with Algorithmia. If you don’t recall where to find that it is in the My Profile page in the Credentials section.
Next notice the variable ‘algo.’ This is where we pass in the path to the algorithm we’re using. Each algorithm’s documentation will give you the appropriate path in the code examples section at the bottom of the algorithm page.
And last, the list comprehension ‘tweet_list’ holds our data after looping through the result of the algorithm by passing in our input variable to algo.pipe(input).result.
Now, you simply write your data to a CSV file that is named after your query. Note: if your query is a space separated string, then the script will join the query with a dash.
Step Two: Collecting Data
It’s time to call our script with our query ‘Donald Trump OR Trump’ which will grab tweets with the terms ‘Donald Trump’ or ‘Trump,’ and will then write a file to your data file called ‘Donald-Trump-OR-Trump.csv’.
Try running the script again, but this time passing in ‘Hillary Clinton OR Hillary’ as the query.
With both CSV files in our data folder, we can now create a script called profanity_analysis.py
Step Three: Data Preprocessing
In this next script, we’ll first clean up our dirty data, get rid of emoticons, hashtags, RT’s, etc. Then, we’ll explore the English stop words and profanity algorithms.
Our first step in cleaning up the data was to use some regex to remove emoticons and numbers. Then, we call the Retrieve Stop Words algorithm to further scrub our data, and helps the Profanity algorithm run a little faster since it doesn’t have to parse through all the common English words that provide no value.
That’s it for cleaning up our tweets!
Step Four: Checking Tweets for Profanity
Now, we’ll check out the Profanity Detection algorithm and discover the swear words in our tweets. This algorithm is based on around 340 words from noswearing.com, which does a basic string match to catch swear words. Check out the Profanity algorithm page to learn more about the details of the algorithm, and how you can customize your word list by adding your own offensive words since fun, new offensive colloquialisms are constantly being added to the English language everyday. Don’t believe us? Just check out Urban Dictionary for some new favorites that have popped up.
The profanity function is fairly straightforward:
You’re simply passing in the list of words that have been cleaned of English stop words. We’ve joined them into a single corpus since we’re interested in the total profanity of all the tweets from our data, rather than the profanity of each tweet. Our function profanity() prints out both the result of the algorithm along with the total swear words. At the time of this writing there were 30 swear words for the query ‘Donald Trump OR Trump’ and ‘Hillary Clinton OR Clinton’ returns 8 swear words.