What I learned from Kaggle about real-world data science

October 24, 2017

Roman Khomenko

Contents

Kaggle is a platform for data science competitions. I’ve spent about 3 months of my life competing on Kaggle, and that’s a lot of time! It’s about 1% of the working part of my life (in the best case scenario, without any meteorites or nuclear wars). So I need to justify this with some positive, practical perks!

Contents

1. Overcoming overfitting

In one of the competitions, overfitting hit me pretty hard. Together with my teammates, we held the 1st place on the public leaderboard, however, because of overfitting, ended up at the 3rd place on the private leaderboard.

We were overconfident. We had good Cross Validation at first, but over time as our model become bigger it was hard to track a correct Cross Validated score and we believed the Public Leaderboard more and more. It turned out to have been a big mistake.

After such fail, I am now extremely careful about overfitting in real data science work. For example, at some point I was working on text classification, and noticed that my RandomSearch found strange hyperparameters. After a short investigation, I found that the model was overfitting to the number of steps in a test (I was working on our Rainforest tests, which are all made out of different numbers of steps, where each step is a few lines of text). Below on the left you can see a noisy Partial Dependence Plot from that model, where the x-axis is showing the number of steps and y-axis the target probability. However, after I fixed validation (by forcing different step numbers to different folds in validation), overfitting was gone, and I got the nice, smooth plot you can see on the right.

Sometimes it’s pretty easy to build validation but often it isn’t. I’d recommend to read these two short Kaggle posts about non-trivial validation: – validation for pairs of items with time component – time-series validation when you need to predict for some specific time in future

2. Choosing a metric

Kaggle will not teach you how to choose a metric for your concrete task – you already have one chosen for you when you enter the competition. But pretty quickly you’ll learn how to optimize a given metric with your algorithms. And, after a few more competitions, you’ll have feeling what metric is suitable for your task.

Interesting links: – Probability Calibration of RandomForest to logloss metric – Calibration when validation have different distribution – Custom Xgboost Objective

3. Feature engineering

Everyone knows how important feature engineering is, but in practice when you’re solving a a new task, it’s not so easy to do a good job there. However, after competing on Kaggle a bunch of times, you’ll get a pretty good feeling for a lot of possible features suitable for different tasks.

Some examples:

4. Cooperation

Kaggle is about competition in theory, but in reality it’s more about cooperation.

During competitions many people share their approaches in public scripts; you can team up with experts around the world. And after a competition, it’s standard practice for top finishers to explain their approaches. And sometimes they even publish full source code – so it’s pretty easy to learn and make friends!

5. Deep learning

While taking part in Machine Learning competitions, you’ll learn a lot of practically-useful things, even if not all (e.g. defining the problem, data collection, cleaning, metric, objective, etc. all fall outside the scope of competitions). But they’re especially useful if you want to get into Deep Learning: you’ll get knowledge about what data size you need, what result you’ll get, how to train tricky models. So after such Deep Learning competition – you can do similar task in practice with great results. If you took Jeremy Howard’s course about Deep Learning, you’ll see he uses Kaggle over all the course to show how to apply Deep Learning.

Just a few links:

Try Data Science competitions, and if you see me there, let’s team up! Cheers!

PS: I’ve spend much more time in Fallout 1 and 2, but I’m not going to justify that 🙂