The Home Depot Product Search Relevance competition challenged Kagglers to predict the relevance of product search results. Over 2000 teams with 2553 players flexed their natural language processing skills in attempts to feature engineer a path to the top of the leaderboard. In this interview, the second place winners, Thomas (Justfor), Sean (sjv), Qingchen, and Nima, describe their approach and how diversity in features brought incremental improvements to their solution.
What was your background prior to entering this challenge?
Thomas is a pharmacist, with his PhD in Informatics and Pharmaceutical Analytics and works in Quality in the pharmaceutical industry. At Kaggle he joined earlier competitions and got the Script of the Week award.
Sean is an undergraduate student in computer science and mathematics at the Massachusetts Institute of Technology (MIT).
Qingchen is a data scientist at ORTEC Consulting and a PhD researcher at the Amsterdam Business School. He has experience competing on Kaggle but this was the first time with a competition related to natural language processing.
Nima is a PhD candidate at the Lassonde School of Engineering at York University focusing on research in data mining and machine learning. He has also experience competing on Kaggle but up to now focused on other types of competitions.
Between the four of us, we have quite a bit of experience with Kaggle competitions and machine learning, but minor experience in natural language processing.
What made you decide to enter this competition?
For all of us, the primary reason was that we wanted to learn more about natural language processing (NLP) and information retrieval (IR). This competition turned out to be great for that, especially in providing practical experience.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
All of us have strong theoretical experience with machine learning in general, and it naturally helps with the understanding and implementation of NLP and IR methods. However, none of us have had any real experience in this domain.
Let's get technical
What preprocessing and supervised learning methods did you use?
The key to this competition was mostly preprocessing and feature engineering as the primary data is text. Our processed text features can broadly be grouped into a few categories: categorical features, counting features, co-occurrence features, semantic features, and statistical features.
Categorical features: Put words in categories such as colors, units, brands, core. Count the number of those words in the query/title and count number of intersection between query and title for each category.
Counting features: Length of query, number of common grams between query and title, Jacquard similarity, etc.
Co-occurrence features: Measures of how frequently words appear together. e.g., Latent Semantic Analysis (LSA).
Semantic features: Measure how similar the meaning of two words is.
Statistical features: Compare queries with unknown score to queries with known relevance score.
It seems that a lot of the top teams had similar types of features, but the implementation details are probably different. For our ensemble we used different variations of xgboost along with a ridge regression model.
Word cloud of Home Depot product search terms.
For models and ensemble we started with random forest, extra trees and gbm-models. Furthermore xgboost and ridge were in our focus. Shortly prior to the end of the competition we found out, that first random forest and then extra trees did not help our ensembles anymore. So we focused on xgboost, gbm and Ridge.
Our best single model was a xgboost-model and scored 0.43347 on the public LB. The final ensemble consists of 19 models based on xgboost, gbm and Ridge. The xgboost-models were made with different parameters including binarizing the target, objective reg:linear, and objective count:poisson. We found, that the Ridge Regression helped in nearly every case, so we included it in the final ensemble.
Our data processing pipeline.
Were you surprised by any of your findings?
A surprising finding was the large number of features which had predictive ability. In particular, when we teamed up, it was better to combine our features than to ensemble our results. This is quite unique as most of the time new features are more likely to cause overfit but not in this case. As a result, adding more members to the team was highly likely to improve score which is why the top-10 were all teams of at least 3 people.
Which tools did you use?
We used mainly Python 3 and Python 2. The decision for Python 2 is interesting as some of the used libraries are still not available for Python 3. In our processing chain we used the Python standard tools for machine learning (scikit-learn, nltk, pandas, numpy, scipy, xgboost, gensim). Nima used R for feature generation.
How did you spend your time on this competition?
After teaming up, Sean and Nima spent most of their time on feature engineering and Thomas and Qingchen spent most of their time on model tuning.
What was the run time for both training and prediction of your winning solution?
In general, training/prediction time is very fast (minutes), but we used some xgboost parameters that took much longer to train (hours) for small performance gains. Text processing and feature engineering took a very long time (easily over 8 hours for a single feature set).
Words of wisdom
What have you taken away from this competition?
First of all quite a lot of Kaggle ranking points and Thomas got his Master badge! Overall this was a very difficult competition and we learned a lot about natural language processing and information retrieval in practice. It now makes sense why Google is able to use such a large number of features in their search algorithm as many seemingly insignificant features in this competition were still able to provide a tangible performance boost.
How did your team form?
Initially Thomas and Sean teamed up as Sean had strong features and Thomas experience in models and Kaggle. The models were complementing well and ensembling brought the team into the top-10. A further boost was made when Qingchen joined with his features and models. At this point we (and other teams) realized that it's a necessity to form larger teams in order to be competitive as combining features really helps improve performance. We decided to ask Nima to join us as he had an excellent track record and was also doing quite well on his own.
Working together was quite interesting as we are from Germany, US, Netherlands and Canada. The different time zones made direct communication difficult; we opted therefore for mail communication. For getting results and continue working on ideas the different time zones were helpful.
Dr. Thomas Heiling is a pharmacist, with his PhD in Informatics and Pharmaceutical Analytics and works in Quality in the pharmaceutical industry.
Sean J. Vasquez is a second year undergraduate student at the Massachusetts Institute of Technology (MIT), studying computer science and mathematics.
Qingchen Wang is a Data Scientist at ORTEC Consulting and a PhD researcher in Data Science and Marketing Analytics at the Amsterdam Business School.
Nima Shahbazi is a second-year PhD student in the Data Mining and Database Group at York University. He previously worked in big data analytics, specifically on Forex Market. His current research interests include Mining Data Streams, Big Data Analytics and Deep Learning.