Blog.kaggle.com

Kaggle InClass: Stanford's "Getting a Handel on Data Science" Winners' Report

2015-01-05

On December 3rd, Stanford Data Mining & Analysis course (STATS202) wrapped up a heated Kaggle InClass competition, "Getting a Handel on Data Science". Beating out 92 other teams, "TowerProperty" came in first place. Below, TowerProperty outlines the competition and their journey to the top of the leaderboard.

Kaggle InClass is provided free of charge to academics as a statistical and data mining learning tool for students. Instructors from any course dealing with data analysis may get involved!

Screen shot of the interactive map we made for exploratory analysis

Team TowerProperty

The three of us met as students enrolled in the Data Mining & Analysis graduate course at Stanford University. We’re all working professionals and took the course for credit through Stanford’s Center for Professional Development.

What was the Stanford Stats 202 InClass challenge?

We were given a database of several thousand accounts with anonymized information about orchestra patrons and their history of subscriptions and single ticket purchases. The challenge was to predict who would subscribe to the 2014-2015 season and at what subscription level. We were given the output variable (total subscriptions) for a subset of the accounts in order to train our model.

What processing methods did you use?

From the very beginning, our top priority was to develop useful features. Knowing that we would learn more powerful statistical learning methods as our Stanford course progressed, we made sure that we had the features ready so we would be able apply various models to them quickly and easily. Broadly, the features we developed are grouped as follows:

Artistic preferences – We parsed the historical concerts data for the conductor and composer and artists performing at a given time, as well as the composer who’s work was being performed. We then compared this to the number of subscriptions bought that year by account. Our goal was to see if we could classify accounts as Bach, Haydn or Vivaldi-lovers and use that for predictions (perhaps Bach lovers will be more likely to buy subscriptions this year as well because there is another Bach concert in the performance schedule).

Spatial features – We computed geodesic distances from each account to the orchestra’s venue locations. Even though we were not given street addresses for each account, we able to approximate distance by assigning latitude and longitude coordinates to each account according to the centroid of its billing zip code.

Derivative features – Using Hadley Wickam’s tidy data methods and R packages, we prepared a large set of predictors based on price levels, number of tickets bought by that account during that year, etc. We also smoothed some features to avoid over-fitting. For most of the competition, we used predictors from the last few years. But in the winning model, we actually used only the last two years and average values over the last five years.

When we later applied the boosted decision trees model, we derived additional predictors that expressed the variance in the number of subscriptions bought – theorizing that the decision tree would be more easily able to separate “stable” accounts from “unstable” ones.

We created 277 features in the end, which we applied in different combinations. Surprisingly, our final model used only 18 of them.

What supervised learning methods did you use?

Most importantly – and from the very beginning – we used 10-fold cross validation error as the metric to compare different learning methods and for optimization within models.

We started with multiple linear regression models. These simple models helped us become familiar with the data while we concentrated our initial efforts on preparing features for later use.

We then applied and tested the following methods:

support vector machines (SVM),

Bayesian Additive Regression Trees (BART),

K-Nearest Neighbors (KNN), and

boosted regression trees

We didn’t have much luck with SVM, BART and KNN. Perhaps we did not put enough effort into that, but since we already had very good results from using boosted trees, the bar was already quite high. Our biggest effort soon turned to tuning the boosted regression tree model parameters.

Using cross validation error, we tuned the following parameters: number of trees, bagfrac, shrinkage, and depth. We then tuned the minobinsnode parameter – we saw significant improvements when adjusting the parameter downwards from its default setting.

Our tuning process was both manual and automated. We wrote R scripts that randomly changed the parameters and set of predictors as then computed the 10-fold cross-validation error on each permutation. But these scripts were usually used only as a guide for approaches that we then further investigated manually. We used this as a kind of modified forward selection process.

In the end, model that gave us the lowest 10-fold CV error and won the competition, was surprisingly simple and small. It had only 18 predictors:

How did you choose your final submission?

The rule we used to choose our final two submissions was simple: select the one that produced the lowest public Kaggle score, and the one that gave us the lowest 10-fold cross validation error.

This actually turned out to be somewhat of a tough decision because we had other submissions with better public Kaggle scores. Nevertheless, we decided to trust our 10-fold cross-validation and it turned out to be the right choice. (It was actually a pretty tough decision to choose the model that did not use all of the additional predictors, which we had spent so much time creating!)

In the end, the more complicated model – the one which resulted in our best public score – was also beaten by the simpler one.

What was your most important insight into the data?

Probably our most important insight about the data was that there was a pattern to the order of the data we received. In other words, even though the data was anonymized, it still retained useful information from its record sequence.

We noticed this by analyzing the accounts that were buying subscriptions and changing the level of subscriptions over the years, as well as those that bought last year. We observed that the distribution of these values in the file provided was far from random. In fact, the distribution seemed to have a temporal pattern.

The non-random nature of the data file was also manifested in the “missingness” of some spatial features, which can be seen in this plot:

We felt that decision trees have a natural property of finding the right splits, which can be useful to separate some accounts from the others. So, based on that, we started including “account.num” as a predictor and we discovered that it did in fact make our cross-validation scores significantly better. Actually, in the winning model, this simple predictor turned out to be the fourth most important!

Where you surprised by any of your insights?

As the competition progressed, we realized that we were probably using too many predictors. This was somewhat of a surprise to us newbies. We discovered that it was often beneficial to drop a predictor that had high importance in the splits!

We first noticed this phenomenon with our spatial predictors like “billing.city” and “distance-from-venue.” They were always getting chosen as very influential … and yet they were consistently increasing the final CV error. Once we detected this phenomenon, we grew more and more suspicious of any predictor that we were using.

Similarly, it turned out that the more aggressive we were in eliminating older predictors, the smaller the error. In our final model, we used only subscriptions for the 2013-14 and 2012-13 seasons and values averaged over last 5 years.

It turned out that simplifying the model (by removing predictors) actually resulted in substantial improvements.

What have you taken away from this competition?

Keep it simple and build a proper cross validation approach into your work.

What tools did you use?

We used GitHub to host the code and collaborate (you can take a look at it here). This was especially important as our team members lived in distant cities. Our code was written entirely in R (although we sometimes used Excel to derive new predictors).

About team “TowerProperty”:

Michal Lewowski lives in the Seattle-area and works on search engines at Microsoft.

Matt Schumwinger lives in Milwaukee, Wisconsin and is a data consultant at Big Lake Data.

Kartik Trasi lives in the Seattle-area and works on big data analytic platforms for Telecommunications Industry at IBM.