Emanuela Boros (LIMSI/CNRS), Balázs Kégl (LAL/CNRS)
This is an initiation project to introduce RAMP and get you to know how it works.
The goal is to develop prediction models able to identify which news is fake.
The data we will manipulate is from http://www.politifact.com. The input contains of short statements of public figures (and sometimes anomymous bloggers), plus some metadata. The output is a truth level, judged by journalists at Politifact. They use six truth levels which we coded into integers to obtain an ordinal regression problem:
0: 'Pants on Fire!'
1: 'False'
2: 'Mostly False'
3: 'Half-True'
4: 'Mostly True'
5: 'True'
You goal is to classify each statement (+ metadata) into one of the categories.
Further, an nltk dataset needs to be downloaded:
python -m nltk.downloader popular
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename, sep='\t')
data = data.fillna('')
data['date'] = pd.to_datetime(data['date'])
data
data.dtypes
data.describe()
data.count()
The original training data frame has 13000+ instances. In the starting kit, we give you a subset of 7569 instances for training and 2891 instances for tesing.
Most columns are categorical, some have high cardinalities.
print(np.unique(data['state']))
print(len(np.unique(data['state'])))
data.groupby('state').count()[['job']].sort_values(
'job', ascending=False).reset_index().rename(
columns={'job': 'count'}).plot.bar(
x='state', y='count', figsize=(16, 10), fontsize=18);
print(np.unique(data['job']))
print(len(np.unique(data['job'])))
data.groupby('job').count()[['state']].rename(
columns={'state': 'count'}).sort_values(
'count', ascending=False).reset_index().plot.bar(
x='job', y='count', figsize=(16, 10), fontsize=18);
If you want to use the journalist and the editor as input, you will need to split the lists since somtimes there are more than one of them on an instance.
print(np.unique(data['edited_by']))
print(len(np.unique(data['edited_by'])))
data.groupby('edited_by').count()[['state']].rename(
columns={'state': 'count'}).sort_values(
'count', ascending=False).reset_index().plot.bar(
x='edited_by', y='count', figsize=(16, 10), fontsize=10);
print(np.unique(data['researched_by']))
print(len(np.unique(data['researched_by'])))
data.groupby('researched_by').count()[['state']].sort_values(
'state', ascending=False).reset_index().rename(
columns={'state': 'count'}).plot.bar(
x='researched_by', y='count', figsize=(16, 10), fontsize=6);
There are 2000+ different sources.
print(np.unique(data['source']))
print(len(np.unique(data['source'])))
data.groupby('source').count()[['state']].rename(
columns={'state': 'count'}).sort_values(
'count', ascending=False).reset_index().loc[:100].plot.bar(
x='source', y='count', figsize=(16, 10), fontsize=10);
The goal is to predict the truthfulness of statements. Let us group the data according to the truth
columns:
data.groupby('truth').count()[['source']].reset_index().plot.bar(x='truth', y='source');
For submitting at the RAMP site, you will have to write two classes, saved in two different files:
FeatureExtractor
, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features). Classifier
to predict The feature extractor implements a transform
member function. It is saved in the file submissions/starting_kit/feature_extractor.py
. It receives the pandas dataframe X_df
defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.
Note that the following code cells are not executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.
%%file submissions/starting_kit/feature_extractor.py
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import string
import unicodedata
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.utils.validation import check_is_fitted
from sklearn.preprocessing import OneHotEncoder, MaxAbsScaler
def clean_str(sentence, stem=True):
english_stopwords = set(
[stopword for stopword in stopwords.words('english')])
punctuation = set(string.punctuation)
punctuation.update(["``", "`", "..."])
if stem:
stemmer = SnowballStemmer('english')
return list((filter(lambda x: x.lower() not in english_stopwords and
x.lower() not in punctuation,
[stemmer.stem(t.lower())
for t in word_tokenize(sentence)
if t.isalpha()])))
return list((filter(lambda x: x.lower() not in english_stopwords and
x.lower() not in punctuation,
[t.lower() for t in word_tokenize(sentence)
if t.isalpha()])))
def strip_accents_unicode(s):
try:
s = unicode(s, 'utf-8')
except NameError: # unicode is a default on python 3
pass
s = unicodedata.normalize('NFD', s)
s = s.encode('ascii', 'ignore')
s = s.decode("utf-8")
return str(s)
from sklearn.feature_extraction.text import TfidfVectorizer
class FeatureExtractor(TfidfVectorizer):
"""Convert a collection of raw documents to a matrix of TF-IDF features. """
def __init__(self):
super(FeatureExtractor, self).__init__(
input='content', encoding='utf-8',
decode_error='strict', strip_accents=None, lowercase=True,
preprocessor=None, tokenizer=None, analyzer='word',
stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
ngram_range=(1, 1), max_df=1.0, min_df=1,
max_features=None, vocabulary=None, binary=False,
dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
sublinear_tf=False)
def fit(self, X_df, y=None):
"""Learn a vocabulary dictionary of all tokens in the raw documents.
Parameters
----------
raw_documents : iterable
An iterable which yields either str, unicode or file objects.
Returns
-------
self
"""
self._feat = np.array([' '.join(
clean_str(strip_accents_unicode(dd)))
for dd in X_df.statement])
super(FeatureExtractor, self).fit(self._feat)
return self
def fit_transform(self, X_df, y=None):
self.fit(X_df)
return self.transform(self.X_df)
def transform(self, X_df):
X = np.array([' '.join(clean_str(strip_accents_unicode(dd)))
for dd in X_df.statement])
check_is_fitted(self, '_feat', 'The tfidf vector is not fitted')
X = super(FeatureExtractor, self).transform(X)
return X
First, we preprocess the text. Preprocessing text is called tokenization or text normalization.
The first step or preprocessing. Split sentences in words.
The most frequent words often do not carry much meaning. Examples: the, a, of, for, in, .... This stopword list can be found in NLTK library stopwords.words('english')
. Throw away unwanted stuf as in ["`", "
", "..."] or numbers.
This is optional. English words like look can be inflected with a morphological suffix to produce looks, looking, looked. They share the same stem look. Often (but not always) it is beneficial to map all inflected forms into the stem. The most commonly used stemmer is the Porter Stemmer. The name comes from its developer, Martin Porter. SnowballStemmer('english')
from NLTK is used. This stemmer is called Snowball, because Porter created a programming language with this name for creating new stemming algorithms.
strip_accents_unicode
¶Transform accentuated unicode symbols into their simple counterpart. è -> e
Before going through the code, we first need to understand how tf-idf works. A Term Frequency is a count of how many times a word occurs in a given document (synonymous with bag of words). The Inverse Document Frequency is the the number of times a word occurs in a corpus of documents. tf-idf is used to weight words according to how important they are. Words that are used frequently in many documents will have a lower weighting while infrequent ones will have a higher weighting.
class FeatureExtractor(TfidfVectorizer)
inherits a TfidfVectorizer
which is a CountVectorizer
followed by TfidfTransformer
.
CountVectorizer
converts a collection of text documents to a matrix of token (word) counts. This implementation produces a sparse representation of the counts to be passed to the TfidfTransformer
.
The TfidfTransformer
transforms a count matrix to a normalized tf or tf-idf representation.
A TfidfVectorizer
does these two steps.
The feature extractor overrides fit by provinding the TfidfVectorizer
with a new preprocessing step that has been presented before.
The classifier follows a classical scikit-learn classifier template. It should be saved in the file submissions/starting_kit/classifier.py
. In its simplest form it takes a scikit-learn pipeline, assigns it to self.clf
in __init__
, then calls its fit
and predict_proba
functions in the corresponding member funtions.
%%file submissions/starting_kit/classifier.py
# -*- coding: utf-8 -*-
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier
class Classifier(BaseEstimator):
def __init__(self):
self.clf = RandomForestClassifier()
def fit(self, X, y):
self.clf.fit(X.todense(), y)
def predict(self, X):
return self.clf.predict(X.todense())
def predict_proba(self, X):
return self.clf.predict_proba(X)
It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit
, not on the classes defined in the cells of this notebook.
First pip install ramp-workflow
or install it from the github repo. Make sure that the python files feature_extractor.py
and classifier.py
are in the submissions/starting_kit
folder, and the data train.csv
and test.csv
are in data
. Then run
ramp_test_submission
If it runs and print training and test errors on each fold, then you can submit the code.
!ramp_test_submission --quick-test
Once you found a good feature extractor and classifier, you can submit them to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event fake_news (Saclay Datacamp, DataFest Tbilisi) for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request is accepted, you can go to your sandbox (Saclay Datacamp, DataFest Tbilisi) and copy-paste (or upload) feature_extractor.py
and classifier.py
from submissions/starting_kit
. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions (Saclay Datacamp, DataFest Tbilisi). Once it is trained, you get a mail, and your submission shows up on the public leaderboard (Saclay Datacamp, DataFest Tbilisi).
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions (Saclay Datacamp, DataFest Tbilisi). You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission
. The script prints mean cross-validation scores
----------------------------
train sacc = 0.77 ± 0.012
train acc = 0.983 ± 0.01
train tfacc = 0.835 ± 0.014
valid sacc = 0.361 ± 0.05
valid acc = 0.144 ± 0.119
valid tfacc = 0.575 ± 0.101
test sacc = 0.355 ± 0.013
test acc = 0.197 ± 0.023
test tfacc = 0.544 ± 0.021
The official score in this RAMP (the first score column after "historical contributivity" on the leader board (Saclay Datacamp, DataFest Tbilisi) is smoothed accuracy, so the line that is relevant in the output of ramp_test_submission
is valid sacc = 0.361 ± 0.05
. When the score is good enough, you can submit it at the RAMP.
You can find more information in the README of the ramp-workflow library.
Don't hesitate to contact us.