Blog.kaggle.com

Kaggle Master, data scientist, & author: An interview with Luca Massaron

2016-07-14

We're always fascinated to learn about what Kagglers are up to when they're not methodically perfecting their cross-validation procedures or hitting refresh on the competitions page. Today I'm sharing with you Kaggle Master Luca Massaron's impressive story. He started out like many of us self-learners out there: passionate about data and possessing an unquenchable thirst for the educational and collaborative opportunities available on Kaggle.

In this interview, Luca tells us how he got started in data science, what he's learned from competing in nearly 100 Kaggle competitions, and how it has led to success in a career in the field. And not only that, he's given back to the autodidact community by authoring several books on topics in data science including his most recent, Machine Learning For Dummies. Naturally I ask him what advice he has for aspiring data scientists and, in essence, his reply is to get hands-on with the data. But truly, his words of wisdom are dense throughout and there are more than a few nuggets of thoughtful insight for anyone who has something to learn about data science.

Luca's fourth book, Machine Learning For Dummies.

Background

Megan Risdal: Let’s get started by first learning a bit about you and your background. What is your academic and professional background?

Luca Massaron: I've obtained the equivalent of a master degree in Political Sciences from the University of Bologna in Italy. Although the four years course of University studies provided me with solid foundations in statistical inference and methodology of research, actually, I spent more time studying international politics, philosophy and history (and my final thesis has been actually about modern history). You can really figure out me as being divided between two cultures, a pretty humanistic one at the foundations of my education, and a practical, scientific one derived mostly from my professional experience.

Luca Massaron's Competitions achievements.

A few years later, after starting my professional career, along the way I also obtained an executive MBA, thus strengthening my understanding of business (especially marketing) and management. From a professional point of view, I started working in a Web startup measuring the behavior of Internet users using a longitudinal panel study. Then I worked for a few years in the advertising sector (in a WPP company) specializing in direct marketing. Around 2005, I started working independently as a consultant and a marketing research director and that lasted for over 10 years, mainly operating in oil & gas and manufacturing.

As of three years ago, I operate exclusively as a data scientist. First at a Danish company, Velux, empowering their local sales by machine learning, then briefly at a company in Berlin, Searchmetrics, developing software and studies for search engine optimization (SEO). Now, it has already been a year that I am in the insurance sector, working as a senior data scientist at Axa Italy.

Megan: What sparked your interest in data science?

Luca: My interest dates back when the term data science was still completely unknown and data mining was a more common term indicating someone interested in digging out information from databases using algorithms and specialized software like SAS or SPSS. At the beginning of my professional career, in 2000, working in a Web startup measuring the Web proved a really lucky and unique formative experience. It was not only a data driven company, but most of its data (and for those times there was quite a lot available in the storage systems of the company) was simply being left apart unused.

Realizing that and perceiving an opportunity, I asked for permission and started working on that data repositories using very simple tools (Microsoft Office and some Pascal code I scripted by myself). After playing with data for a while, I managed to produce some reports that actually become my first data product. They were immediately sold to our customers of that time, local Italian branches of companies like Microsoft, Excite!, Altavista and Yahoo. At the time I got so surprised that data could be so valuable and that important companies were paying for it. I quickly developed the intuition that being able to do more and more with data should have been a promising path for my career.

As a secondary, but not less important, take away, I also found out that investigating data for insights that could be used for something practical made me passionate and satisfied my curiosity and need for discoveries. After all, I longed an academic career but had to end up in business. Working with data was something that resembled a lot my initial dream career.

Megan: Tell us about your experiences self-learning data science?

Luca: Here I really have to distinguish between before and after Kaggle! Before Kaggle was founded and I had discovered it, I trained myself using the classical learning instruments: papers, books, some Internet website like KDNuggets, expensive courses in the quantitative field from business schools. I actually started from books, my first purchase being Data Mining by Han and Kamber.

Another very important tool that helped me learning was R. I discovered open-source and R in 2002 and it soon became my favorite tool for analysis and modeling because I couldn't have access to the commercial tools such as SAS and SPSS. I remember I spent long hours glancing at the list of R libraries, exploring vignettes and help files and trying all the examples I could. I had some foundations in statistics, the company I was working for relied on statistical analysis for its core business and I had been given great freedom in learning and experimenting with data. An ideal condition for a growing data scientist but for the fact that I was the only data practitioner around. I had no mentor and it was difficult to build a network.

Experimenting with small datasets can help you figuring out how some random noise can really change your machine learning solution. Do experiment a lot. From Machine Learning For Dummies, Wiley 2016.

Discovering Kaggle changed everything. First of all, Kaggle helped me moving from a statistical paradigm, based on a-critical curve/distribution fitting approach, to a real data science paradigm based on experimentation and empirical validation. It may appear obvious, but actually it is not, you really need a scientific mindset for being a data scientist and that can only derive from your education background and your direct experience.

Megan: I wonder if you could elaborate on what you mean by needing a scientific mindset to be a be a data scientist?

Luca: As Leo Breiman (the creator of Random Forests), pointed out in his famous “Statistical Modeling: The Two Cultures”, when using statistical models and trying to reach conclusions from data, there are two cultures, the previously dominant one, the data modelling culture, and the algorithmic culture, the real engine behind data science.

The data modelling culture is founded on priori assumptions about the distribution and the characteristics of data. In a few words, you assume that facts of the world behave in a certain way, consequently you fit a statistical model to data, accordingly to that assumption. If your model is working properly, you conclude that your assumptions are valid and so it is the model. It works perfect when you really have a deep understanding about how things work in the World.

The challenging paradigm, the algorithmic modelling culture, just relies on black boxes (the machine learning algorithms) to find a reasonable way to map a response to an input, as empirically seen in the world. If the mapping passes a series of validation tests, then you can accept the black box as a good approximator of the observations of the World that you are interested in. Unfortunately, the black box cannot tell you anything about the World, but it can help you anticipate the World itself, to predict what will happen in the future or has happened in the past, based on some inputs.

If you take the road of the algorithmic modelling culture, which is the very foundations of data science, in the end, you have to realize that really “it is generalization that counts” (as said by Pedro Domingos in “A Few Useful Things to Know about Machine Learning”). And that you can generalize only at the condition that:

You try everything, every hypothesis with no a-priori assumption because no solution can work for everything and you cannot know before-hand what will work or not (an idea which is reminiscent of the no free lunch theorem).

You test everything you try, without a-priori assumptions or knowledge which could change or distort the results of your test.

This translates for a data scientist in mastering a certain number of machine learning algorithms and knowing how to correctly cross-validate every hypothesis, which requires always putting to test data transformations, algorithm choice, choice of algorithm’s parameters, because a hypothesis is made of all that.

Again, it may appear obvious, but I’ve seen many situations, both during Kaggle competitions and on many projects at work, when some data scientists didn’t follow this path thus obtaining ambiguous or disastrous results. Having a scientific mindset for a data scientist, in my experience means to seriously abide by the algorithm modelling culture, accepting its strict empiricism based on data, and it means supporting work using only scientific tests (namely properly executed cross-validation) to prove the effectiveness of the chosen algorithmic solution.

The road to Kaggle Master

Megan: You’ve been on Kaggle for five years, participated in many competitions, and landed in the top 10% twenty-two times by my count! I’d love to hear about your journey to Master Kaggler status and how it kick started your career as a data scientist.

Megan: How did you get started on Kaggle?

Luca: I cannot clearly remember how I found out about Kaggle, most likely I read some news about it on KDNuggets, the Website on data mining and knowledge discovery managed by Gregory Piatetsky-Shapiro. I remember anyway that the idea of data competitions really appealed to me, since I was curious to test myself on new problems with data and algorithms, but, on the other hand, being Kaggle competitions open I didn’t like so much the idea of a direct confrontation with other data experts. Somehow, I feared to discover that what I knew about data mining was somehow ineffective for the problems proposed on Kaggle or easily outdated by more skillful and up-to-date players. I bet this a sentiment is quite common to many players starting or evaluating to start participating in Kaggle competitions.

In the end I broke the ice because of a competition whose topic really appealed to me: “Psychopathy Prediction Based on Twitter Usage”. As a marketing researcher, having done quite a few studies before on very similar topics, I was really curious to understand if there was any relation between what a person publicly expressed on Twitter and his or her personality profile. I thought that by experiencing directly with the competition data and by trying modelling it, I could understand better both the problem and the value of Twitter data for guessing others’ personality, and also for a business perspective. What happened after I enlisted into the competition, is that I immediately found out a simple trick to fit a Random Forests model when many features had missing values, a lot of way to pre-process data, and how I was applying machine learning methods the wrong way since during the competition I overfitted badly, passing from the 3rd position on the public leaderboard to the 57th in the private one. In the end, in spite of the not so glorious result, I really got dragged into competitions.

Since that first participation of mine, I took part in over 95 competitions and my intention is to go on after passing 100 :-). Among them all, I enjoyed some competitions more and in some of them I really spent huge amounts of time (on the other hand on some competitions I just tried to beat the benchmark or just spent a few submission to understand how to solve the data problem at an acceptable level).

Luca has participated in an impressive number of competition during his tenure on Kaggle.

Megan: How did Kaggle give you the direct exposure you required to advance your abilities?

Luca: Apart from pushing me revising my learning foundations, Kaggle also set my learning schedule. New tricky competitions, new data with unknown features, new business problems behind the competition (because working with data you specialize on certain problems related to your company), led me to discover and master new tools (Python, Vowpal Wabbit, XGBoost, and now recently deep neural networks) but also to get a grasp of new domains of analysis (NLP, Information retrieval, streaming algorithms, big data storage and handling). Or it simply forced me to get out of my comfort zone when using Windows OS (if you want to try advanced libraries and applications Linux is better), and to amass tons of practical experience in feature manipulation and creation. Kaggle finally helped me build a network of great data scientist colleagues, which is fundamental for friendship, learning and career.

Kaggle also led me to MOOCs and I found beneficial to exchange time between Kaggle competitions and EDX and Coursera courses. New competitions led me taking on certain courses, new courses inspired me to take available competitions! For certain periods I exclusively immersed myself in machine learning and coding courses, in others I just worked on Kaggle, testing multiple ideas and methods that I previously learned from MOOCs and other competitions.

Feature creation is really difficult to figure out how it works, but it is not magic, just another way to overcome the limitations of the algorithm you are using. If you are using a line, how can you otherwise separate two concentric circles? From Machine Learning For Dummies, Wiley 2016.

If I have to mention something essential that Kaggle competitions taught me, well, that's the magic of feature creation. After a certain point participating in Kaggle, you can really create in your mind an internal feature library and, posed a new problem, immediately figure out what features to look for or to create in order to optimize the predictions on specific problems or error measures.

Now, because of professional work and daily life, I do have less and less time to properly take part to a Kaggle competition, yet, I prefer spending my time taking on any new competition on Kaggle, exploring their datasets, reading the comments and discussions on the forum, trying to set up a decent submission to every problem. Kaggle is my secret insurance to be always updated and always on the fringe of applied data science.

Megan: What have been your favorite competitions on Kaggle?

Luca: If I have to decide my favourite ones, there two competitions that made me really passionate (and consequently I spent a lot of time on them): “Dogs vs. cats”, trying to correctly classifying photos of dogs from those of cats, and “Multi-label Bird Species Classification - NIPS 2013” where the challenge was to recognize, in a recording took in a wood, what species of birds were singing. Both were the occasion to refine and improve my mastery of Vowpal Wabbit, the online learning software developed by John Langford and, in the case of Dogs vs. Cats, it was the first time for me to use some technique related to deep learning by applying the pretrained network DeCAF (the predecessor of Caffe, both created by Yangqing Jia) to the images in order to create the features necessary for the training.

Somehow, they also both pushed me beyond my limits because they were not the usual machine learning on the pre-processed dataset. Instead, both required handling unusual quantities of data, thus requiring streaming data by single data points, and they require special transformations on data, in one case based on a pre-trained neural network, in the other, appropriately using portions of data (data windows) and contrasting examples in order to let the algorithm associate a bird to its song (actually in each recording used for training you had more birds singing and just a bunch of labels - like a bag of words - with no indication at what exact time a bird started singing).

Megan: How has competing on Kaggle helped you in your career path?

Luca: Apart from the direct opportunities offered by the job forum (and there are plenty of them, every week), making yourself known to recruiters and offering a great ground for successful job interviews, especially if your interviewer took on a competition you also participated in (so you can tell how you managed to submit and show your creativity and expertise in dealing with the problem, especially if you have some public script to share), Kaggle helped me in a less direct, yet more determinant way, being a better data scientist.

In no particular order, I will mention that Kaggle helped me:

Realizing my strong and weak points in data science and developing a clear plan to develop my knowledge and skills.

If it weren’t for my first participation to a Kaggle competition, I would have never understood how I was doing learning wrong and I would have kept overfitting (now my favorite competitions are those where the risk of overfitting is higher, actually :-)). After the competition, reading the forum and trying to replicate the best solution really proved a reality check of my competencies in data science and helped me to draft a plan to align myself. On that specific occasion, I decided to take on Coursera’s Machine Learning course by Andrew Ng, on other occasions it has been a matter of studying a chapter of a book, a paper, or experimenting new functions or packages by myself.

Every new competition is a great occasion for me to check if I know about the problem, about the data and the features, about a possible solution, about how to handle the data and the learning. I notice something in a Kaggle competition that I do not know, well, it is definitely the signal to start learning about it because it is or it will be soon recognized by the data science community as mainstream.

Keeping me informed of machine learning developments.

Coming from an R background, and having handled everything by R before, I didn’t felt compelled learning a new language for data analysis. At least until I realized that I was missing an opportunity to score high on Kaggle competitions. Kaggle showed me it was important to master Python and its core ML package, Scikit-learn. But it also hinted me at taking notice of new algorithms and software such as Vowpal Wabbit, LibFM, XGBoost, H2o. Now it is the time of deep learning by the Keras library developed by François Chollet and promoted by many Kaggle competitions. If you pay attention at what successful Kaggle participants use, you will always be on top of data science innovation.

Informing me of possible features to be found in a data problem and teach me how to properly and creatively do feature creation with them.

Actually, there are no more well kept secrets that can help you score great in a competition. Once there were Vowpal Wabbit and LibFM, but now XGBoost and deep learning are at everyone’s reach. We are all using the same tools, in the end. Therefore it is how we use them that makes the difference. Ensembling and stacking solutions is a great part of the game, though not a completely decisive one. Still what makes the difference is what features you can derive from the existing ones. But feature creation is still more than an art than a science, and, yes, it cannot be easily automated as would be an ensembling or stacking procedure.

Learning how to do the right and most effective feature creation in a data science project is part of the domain knowledge of data science, just as statistics and hacking. Participating and studying the winning solutions of Kaggle competitions will definitely help you to know what to expect and what to do with data in respect of different problems and different error measures to be optimized. Even if at your firm there is some feature library (that is a record of all the features used in the data projects), it really does make the difference to know at heart what should be done than to a-critically replicate what others data scientists have done before you. Being a senior data scientists it is not just a matter of seniority on the job or about how many packages or algorithms you know, it is about what you know to do with data.

Posing data problems that I couldn’t encounter during my daily work.

We all work in specific domains, and that can really limit the kind of problems and data you can encounter. Since job mobility for data scientists has become high, not having previous experience of particular data problem can become a limitation. Unless you participate to Kaggle competitions derived from business sectors different from yours. That can really make you very competitive in the job market and help preventing you to get too bored in your actual job if you have little variety of data and problems at hand.

Building a network of contacts with other great practitioners and friends.

I will never cease to stress how Kaggle is great place for networking. You can know many incredibly expert and skilled data scientists, participate together in competitions and even start projects together (like opening a company or writing a book). I met two of my co-authors, Alberto Boschetti and Bastiaan Sjardin, thanks to Kaggle competitions. Moreover, you can get from your Kaggle networks lots of exchange about job openings, meetups, and conferences.

Experimenting with new technologies, ideas, techniques that I learned about by self-studying or I developed an intuition for by myself

After all, Kaggle it is just like the real thing in data science. Yes, data is often almost clean (something quite rare when working with a company’s data), but problems, features, number of examples are very realistic. Even if it won’t skyrocket you at the head of the leaderboard, a Kaggle competition is right place for testing new ideas and technologies. The data is too large? The computational time is too much? Build a NoSQL database, use virtual machines or docker machines on the cloud, control them all your EC2 instances by boto, just to name a few.

And actually I found out that in job interviews, they were more interested, when talking about a competition, knowing what I did from a practical point of view to overcome a limitation than the results that I achieved in the end.

From Kaggle to a career in data science

Megan: I’d love to hear about a particular project or problem you worked on in your career to which you were able to apply a technique or something you learned on Kaggle. Do you have a project like this that you’re proud of that you could tell us about?

Luca: Actually, there are many professional projects that benefited from my participation in Kaggle competitions. That’s happens every day, even in these very days. I always find a way to use what I learn from Kaggle. Usually what I employ is a particular feature creation, or data processing or machine learning tool that I learned or mastered during a single or multiple competitions. In the particular case of a project I want to share, anyway, I applied a methodology that I previously defined and tested on a Kaggle competition. The competition was CONNECTOMICS and it consisted in estimating the strength of connection between pairs of neurons in a Zebrafish, given some tracks of how neurons fire together in response of stimuli.

I had previous experience in causal analysis and decomposition of effects in a generalized linear model (a sophisticated linear regression), which I immediately connected to this Kaggle competition, but the size of the dataset was really daunting for my i3 core computer (which is still being used in recent competitions!). That led me to find out a lot of ways to preprocess data by R, using a sqlite database (something that I learned from the competition Practice Fusion Diabetes Classification) and reduce it to a more manageable variance-covariance matrix for relative importance decomposition.

The solution of the decomposition hinted at the connection between two neurons and the way I handled the amount of data came very handy a few weeks later when entrusted a business project to discover the determinants of satisfaction and motivation to re-purchase of over 5 million customers of a travel service. I immediately recalled the methodology I experimented in the Kaggle competition and related it to the sparsity (and apparent distance) of the different experiences of the travellers I was studying. Actually, the connection hadn't been a kind of aha or Eureka moment, but it was derived by data exploration and visualization: inspecting portions of the travellers' data matrix and visually translating them into graphs I almost immediately recalled the same visualizations I made up from the CONNECTOMICS' data.

Therefore, what could be learned from the brain of a Zebrafish has been successfully abstracted and applied to a completely different context, and that's an important take-away from Kaggle competitions...

After applying the same methodology, the results proved extremely useful for the management and they were incorporated into a managerial dashboard, inspiring their customer strategy for the next two years. Therefore, what could be learned from the brain of a Zebrafish has been successfully abstracted and applied to a completely different context, and that's an important take-away from Kaggle competitions, because even the most original competitions can teach you something you can apply somewhere else.

Accidentally the story of my participation to CONNECTOMICS doesn't end here, because, when I have been interview for working in Berlin, I found out that the lead data scientist interviewing me also participated to the same competition and that he was amazed at both the solution and the following impact on a completely different problem. So there's also a second take-away, because in Kaggle competitions, even if your solution is not the winning one, it still can prove itself invaluable because of its originality and its possible generalization to completely different problems.

Megan: Can you describe how you make a connection between participating in machine learning competitions and business applications of data science?

Luca: It is not often said, but Kaggle competitions offer a lot of “meta-data” beyond the datasets and the scripts. Start observing from the company sponsoring the competition and get every information possible related to their business and if they have a data science team or data science products. There are many websites with useful public information and you may even find out patents and scientific papers related to the target of the competition.

Make clear in your mind how the solution of the competition could improve their business and reason why they decided to optimize a certain error or score measure (You can really discover a lot by reasoning on the measure and learn that you can also use that instead of the classical MSE or accuracy).

Then inspect the data, observe the features, unless they have been made anonymous, and try to figure out how they generated that data and if the data generation process is a cheap or costly one (or something unique to them, a real competitive advantage on competitors and new entrants in their business sector).

Since my first Kaggle competitions, which allured me because of its usage of Twitter data, I’ve always questioned each competition I’ve participated in such a fashion. When I entered the insurance sector, I immediately understood the challenges and opportunities just by looking back at the Kaggle contests held by insurance companies.

Nevertheless, business ideas can also be generated observing other not so related business contests. For instance, another competition among my favorites has been the “Amazon.com - Employee Access Challenge” (https://www.kaggle.com/c/amazon-employee-access-challenge). Is the problem it poses really exclusive of Amazon or we can find a very similar problem in all large organizations?

Megan: What has been the biggest hurdle in your experience as you transferred your self-taught skills to a professional business environment? How did you overcome it?

Luca: Everything has a context, and when you are learning by yourself, you are really working out of any context. You are not at school where everything is finalized for you to get a diploma. On self-study you are simply lost in the middle of an ocean of knowledge, but the direction is left to you. It is very easy to get toward the wrong direction, to misunderstand what you learned or simply to forget everything you memorized (or you think you have memorized) very very quickly.

Two are the remedies: try to teach what you learned (for instance writing a diary or a blog) or try to practice it. Practicing helps also to successfully transferring what you learned to a professional business environment because you will directly experience the pros and cons of the technique. Some techniques, for instance, are great in certain situations but in others they cannot work because of unsuitable data or because they have a level of complexity that render them unusable for the application you need (for example the training time or the prediction one are really too long).

Discover that by yourself. When you just learned something new, try immediately to insert it into an existing project of yours. It may seem strange to your supervisors that you suddenly, out-of-the blue, want to try word2vec or deep learning in projects that have no need for such solutions, but it will help you learn, understand and make that technique definitely part of your skills. You can just justify yourself by saying that science needs experimentation <img src="https://s.w.org/images/core/emoji/72x72/1f609.png" alt="