Balázs Kégl (LAL/CNRS)
This is a Kaggle data challenge on predicting the probability that a driver will initiate an auto insurance claim in the next year.
%matplotlib inline
import os
import glob
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pandas as pd
The repo contains mock data in /data
, simulating the format of the official Kaggle data, but smaller in size and containing random features. If you want to execute the notebook on the official Kaggle data, sign up to the challenge, download train.7z
and test.7z
, unzip them and place them in kaggle_data/
. If you want to use the starting kit to generate output in the right Kaggle submission format, you will also need to download sample_submission.7z
, unzip it, and place it in kaggle_data/
.
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename)
data.head()
data.describe()
data.dtypes
data.count()
np.unique(data['target'])
data.groupby('target').count()[['id']]
For submitting at the RAMP site, you will have to write two classes, saved in two different files,
FeatureExtractor
, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features), and Classifier
to predict the target.The feature extractor implements a transform
member function. It is saved in the file submissions/starting_kit/feature_extractor.py
. It receives the pandas dataframe X_df
defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.
Note that the following code cells are not executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.
%%file submissions/starting_kit/feature_extractor.py
class FeatureExtractor():
def __init__(self):
pass
def fit(self, X_df, y):
pass
def transform(self, X_df):
return X_df.values
The classifier follows a classical scikit-learn classifier template. It should be saved in the file submissions/starting_kit/classifier.py
. In its simplest form it takes a scikit-learn pipeline, assigns it to self.clf
in __init__
, then calls its fit
and predict_proba
functions in the corresponding member funtions.
%%file submissions/starting_kit/classifier.py
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier
class Classifier(BaseEstimator):
def __init__(self):
pass
def fit(self, X, y):
self.clf = RandomForestClassifier(
n_estimators=2, max_leaf_nodes=2, random_state=61)
self.clf.fit(X, y)
def predict(self, X):
return self.clf.predict(X)
def predict_proba(self, X):
return self.clf.predict_proba(X)
It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit
, not on the classes defined in the cells of this notebook.
First pip install ramp-workflow
or install it from the github repo. Make sure that the python files classifier.py
and feature_extractor.py
are in the submissions/starting_kit
folder, and the data train.csv
and test.csv
are in data
. Then run
ramp_test_submission
If it runs and print training and test errors on each fold, then you can submit the code.
Note that kaggle_data/test.csv
is the actual Kaggle test file, so we have no test labels. To not to crash the test, we mock all 0 labels for the test points. This means that the test scores are not meaningful (only he valid scores are).
!ramp_test_submission
You can use the --quick-test
switch to test the notebook on the mock data sets in data/
. Since the data is random, the scores will not be meaningful, but it can be useful to run this first on your submissions to make sure they run without errors.
!ramp_test_submission --quick-test
You can also keep several other submissions in your work directory submissions
and test them using
ramp_test_submission --submission <submission_name>
where <submission_name>
is the name of the folder in submissions/
.
You can use this starting kit to train models and submit their predictions to Kaggle. problem.save_y_pred
implements outputting the predictions. You can turn on this using the --save-y-preds
switch:
ramp_test_submission --submission <submission_name> --save-y-preds
This will create the arborescence
submissions/<submission_name>/training_output
├── bagged_test_scores.csv
├── bagged_train_valid_scores.csv
├── fold_0
│ └── y_pred_test.csv
├── ...
├── fold_<k-1>
│ └── y_pred_test.csv
└── y_pred_bagged_test.csv
You can find test prediction vectors in each fold folder submissions/<submission_name>/training_output/fold_<i>
and the bagged prediction vector submissions/<submission_name>/training_output/y_pred_bagged_test.csv
. It is this latter that you should submit to Kaggle.
If your goal is to use this starting kit to optimize your Kaggle submission, besides optimizing your feature extractor and classifier, you can also tune the CV bagging scheme by changing the type of cross validation, the number of folds, and the test proportion in problem.get_cv
. We found that test_size=0.5
worked well with an extreme large number of folds, typically n_splits=64
, but these parameters depend on the classifier you are testing, so may need fine tuning.
If you are eligible, you can join the team at ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then ask for a sign-up to the event kaggle_seguro. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) feature_extractor.py
and classifier.py
from submissions/starting_kit
. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission
. The script prints mean cross-validation and test scores
----------------------------
train ngini = 0.119 ± 0.007
train auc = 0.559 ± 0.003
train acc = 0.964 ± 0.0
train nll = 0.156 ± 0.0
valid ngini = 0.114 ± 0.005
valid auc = 0.558 ± 0.002
valid acc = 0.964 ± 0.0
valid nll = 0.156 ± 0.0
test ngini = 0.229 ± 0.256
test auc = 0.307 ± 0.064
test acc = 1.0 ± 0.0
test nll = 0.037 ± 0.0
and bagged cross-validation and test scores
valid ngini = 0.167
test ngini = -0.324
This latter combines the cross-validation models pointwise on the validation and test sets, and usually leads to a better score than the mean CV score. The RAMP leaderboard displays this score.
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is normalized Gini ("ngini"), so the line that is relevant in the output of ramp_test_submission
is valid ngini = 0.167
. When the score is good enough, you can submit it at the RAMP.
You can find more information in the README of the ramp-workflow library.
Don't hesitate to contact us.