Blog.kaggle.com

February 2016: Scripts of the Week

2016-04-04

February's batch of Scripts of the Week highlights some of the month's best content produced by Kagglers on our public datasets. It also includes a great getting started script predicting outcomes of the 2016 NCAA basketball tournaments for March Machine Learning Mania 2016. Stay tuned for the following:

A prediction of fine food review sentiment comparing the performance of three classification algorithms. (The winner may surprise you.)

A simple, but compelling visualization about the status of women's rights in the world.

A teacher’s getting started script that introduces his students to the basics of building and testing regression models.

Several tips on how to streamline your analysis. Spoiler: use the right tool for the job at hand and think collaboratively!

February 5: Building a Prediction Model

Created by: Guillaume Payen
Public Dataset: Amazon Fine Food Reviews
Language: Python

What motivated you to create this script?

Actually, I'm quite new to Kaggle, and before entering into the jungle of competitions, I wanted to train on datasets. I believe that training on datasets is a good way to start on Kaggle: there is no deadline, no competition.

I chose this dataset because it was typically calling for sentiment analysis, which is a classical exercise for text-mining. Using sentiment analysis for prediction was a great challenge for me. The other advantage of using this dataset is that it could generate results that are quite easy to understand, even for people that know nothing about data science.

I really wanted to work on a prediction model, and fact is that for this dataset, Kagglers had built many exploratory analyses, but no prediction. So, it was a good shot for me, and I'm glad I could lead my analysis to significant results.

What did you learn from the code/output?

The dataset was about 500K comments, and after cleaning the data with stemming and normalizing the comments, the final model was made over about 30K words. Computing TF-IDF on 500K documents and about 30K variables generated a huge sparse matrix. Hopefully, Scikit-learn made an awesome job formatting the information in a way that would only keep occurrences above 0.

At first, I had tried to perform the same analysis with R, but the process was sooooo long, and I got stuck with the sparse matrix. I could not find a way to get through this, so I switched to Python, the analysis was much faster, and Scikit-learn was just brilliant! I realized then that I should get used not to get focused on a tool: if one doe not work, I should try another. Result is what matters!

The results of the analysis are good, though not outstanding. I believe that I could spend some more time optimizing the data cleaning. For instance, I realized that using the stopwords removal was a bad choice, it basically lowered the results accuracy. I also use stemming, but if stemming is to strong, it might also have an incidence on the results.

What can more novice data scientists learn from your script/output?

After reading a bit on the Internet, I found that basically, making prediction over sentiment analysis would mostly work with Bayesian models. Fact is that a logistic regression brought better results. It does not mean that what I've read was wrong, but only that for THIS dataset, there was another method that could produce better results.

My advice to more novice Kagglers is that you shall never rely on a single model to make predictions over a dataset. Your instinct could tell you that you already know which method is going to work, but you can never be sure. Of course, you must select a set of methods that are coherent with your data. Work on them, and use methods to compare the results provided by each model. ROC curve is an example, but there are others.

See the code on Scripts

February 12: Gender Equality

Created by: Rob Harrand aka tentotheminus9
Public Datasets: World Development Indicators
Language: R

What motivated you to create this script?

My motivation for all things Kaggle is to learn. I'm relatively new to data science and R, and I find the datasets to be a far more interesting way of learning than the dry, academic tutorials that are often found in books. I was motivated by the subject of gender equality as I'm a keen supporter of a couple of charities that work in this area. Women's rights and status in the world was brought into sharp focus for me just over a year ago when my daughter was born, and I'm often shocked as to some of the archaic attitudes around the world.

What did you learn from the code/output?

The code allowed me to play around with maps and animations, which is something I've recently discovered in R and have become slightly obsessed with. The output demonstrates how different the world can be when it comes to gender equality, and how attitudes and opinions that seem like something from the ancient past are still rife.

What can more novice data scientists learn from your script/output?

My script basically produces two plots, but I tried to make it slightly more interesting by fading one into the other. I'm finding that sometimes the difference between something interesting and something 'cool' can be quite small and trivial such as this. I think that including some element of movement in a visualisation can work really well, and goes a long way to grabbing the attention of the viewer.

See the code on Scripts

February 19: March Mania - Getting Started

Created by: Jared Cross
Active Competition: March Machine Learning Mania 2016
Language: RMarkdown

What motivated you to create this script?

This script was written for the students in my Probability and Statistics class who are participating in March Machine Learning Mania 2016. They've used R to wrangle and visualize data (using dplyr and ggplot2) and we're using this project to learn to build and test regression models. I'm hoping that this code helps them get familiar with the available data sets and see how they can join them to each other. The code also shows them how to create a simple linear model and use it to create a .csv of predictions that they can submit to Kaggle. I'm hoping to create additional scripts in the next couple of weeks that introduce logistic regression and cross validation among other things.

February 26: Explore SF Salary Data

Created by: Michael Griffiths aka msjgriffiths
Public Dataset: SF Salaries
Language: RMarkdown

What motivated you to create this script?

In my case, most of the data I interact with on a day to day basis is broadly the same. I don't often have the opportunity to dive into a totally fresh dataset.

I really wanted to get more practice with EDA (exploratory data analysis), and the SF Salary dataset looked like an interesting, self-contained project that wouldn't take that much time to examine (sometimes, you can go down the rabbit hole...).

What did you learn from the code/output?

I learned quite a bit about San Francisco Salaries!

More seriously, I learned that I need to get a lot better at formatting the code for output. Other scripts that use datatable() are much easier to read through. I also learned that I need a more systematic approach to EDA - I don't think all of the steps I took made the most sense, and in at least one case I missed an insight only to discover it later (the fact that only 2014 employees were classified into PT/FT).

What can more novice data scientists learn from your script/output?

Hopefully, what mistakes to avoid!

More broadly, if I summarise my approach, I think there are a few things things. First, check your inputs - make sure what you've loaded in is what you think it is!

Second, try to explain all variation in the dataset across every dimension. That can get really hard to scale, but not conditioning on dimensions can hide useful variation.

Third, use the best tool for the job - and in the case of EDA, "best" is frequently synonymous with "fastest." I re-used mannychapman's great script SF Salaries by JobType to categorize a sizable fraction of the jobs instead of doing it myself; and instead of converting the logic to R, I just issued the SQL query. That saved me quite a lot of time!

Finally, the end point of EDA should be some initial conclusions, and also avenues for further analysis. If someone was going to build on your work, what should they look at?

See the code on Scripts

Click the tag below for more posts highlighting Scripts of the Week!