Boosting Claims Predictions: an Analysis on Kaggle Data

Abstract

Predictive modeling on data from the Porto Seguro’s Safe Driver Prediction competition hosted on the Kaggle platform is performed leveraging machine learning boosting algorithms (AdaBoost and XGBoost). We refer to the document "On Boosting: Theory and Applications" to complement the analysis presented in this notebook.

Getting Started with Python and Jupyter Notebook

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility.

In [ ]:
# Notebook settings
###################

# resetting variables
get_ipython().magic('reset -sf') 

# formatting: cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# plotting
%matplotlib inline
In [ ]:
# importing xgboost REMARK: run this cell only if other imports failed. Delete it in case xgboost has been already imported
import pip
pip.main(['install', 'xgboost'])
In [ ]:
# loading Python packages
#########################

# scientific packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy

# boosting
import xgboost
from xgboost import XGBClassifier
from xgboost import plot_importance

# scipy
from scipy.stats import chi2

# pandas: selected modules and functions
from pandas.plotting import scatter_matrix

# sklearn: selected modules and functions
from sklearn import decomposition
from sklearn.decomposition import PCA

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.utils import shuffle

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.pipeline import Pipeline

from sklearn.metrics import roc_curve
from sklearn.metrics import auc 
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import zero_one_loss

# utilities
from datetime import datetime

Data Import

For the given Kaggle competition, two datasets are provided, one for training data and the other for generating predictions to be submitted. The latter lacks of the target variable (denoted by target in the former): it will not be imported in this notebook. Therefore, the whole analysis is carried out on the original 'training' data. A remark on data acquisition procedure: a direct download from the Kaggle Porto Seguro Challenge website in a landing folder on a local machine is performed; original data are then copied in a separate folder and there unzipped. Import of unzipped data is performed using pandas as follows:

In [ ]:
# data import: specify the path to the competition training data
path = '...'
data = pd.read_csv(path)

Imported dataset, i.e. data, comprises of 595212 records and 59 columns:

In [ ]:
print('Structure of imported data:', data.shape)

Memory usage of data:

In [ ]:
print('Memory usage of `data` (in bytes):', pd.DataFrame.memory_usage(data, index=True, deep=True).sum())

Structural Data Analysis

From the data description on the Kaggle Porto Seguro Challenge website the following information on the features in data are provided:

  • Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).
  • Feature names include the postfix bin to indicate binary features and cat to indicate categorical features.
  • Features without these designations are either continuous or ordinal.
  • Values of -1 indicate that the feature was missing from the observation.
  • The target column target signifies whether or not a claim was filed for that policy holder.

The variable target is therefore the label used to train the classifiers. Data types for all columns in data are shown below:

In [ ]:
# data types
data.info()

Both integer and float data types are present; the variable id uniquely identifies data records; a quick check shows that no duplicate record exists:

In [ ]:
# duplicates? None
data.drop_duplicates().shape

One continues by checking the first 10 entries of the data dataset before moving to a high level summary using describe().

In [ ]:
# imported data: first 10 entries
data.head(10).T
In [ ]:
# imported data: statistical summaries with describe
data.describe().T

Missing Values

Missing values are encoded with -1, as discussed in the official data description on the Kaggle Porto Seguro Challenge website; therefore the following code is implemented to have an overview of all variables with missing values

In [ ]:
# missing values (encoded as -1)
feat_missing = []

for f in data.columns:
    missings = data[data[f] == -1][f].count()
    if missings > 0:
        feat_missing.append(f)
        missings_perc = missings/data.shape[0]
        
        # printing summary of missing values
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))

# how many variables do present missing values?
print()
print('In total, there are {} variables with missing values'.format(len(feat_missing)))

Missing value imputation will be discussed in section Imputation of missing values. Due to the high number and different types of the features in data, univariate analysis is performed to gather insights on the provided data, as shown in the forthcoming section.

Univariate Analysis

Meta-Information-Encoding

One starts the univariate analysis of data by recording the feature type into a meta-information data frame, as shown in B. Carremans' kernel. This encoding will be extremely useful in the following, allowing for a feature type-targeted analyisis.

In [ ]:
# B. Carremans: recording meta-information for each column in train, following the official data description on the Kaggle Porto Seguro Challenge
info = []

for f in data.columns:

    # defining the role (target and id have to be separated from the other features)
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    # defining the levels    
    
    # _bin postfix = binary feature (target is binary as well)
    if 'bin' in f or f == 'target':
        level = 'binary'
    
    # _cat postfix = categorical feature
    elif 'cat' in f or f == 'id':
        level = 'categorical'
        
    # continuous or ordinal features: those which are neither _bin nor _cat    
    elif data[f].dtype == float:
        level = 'interval'
    else:
        level = 'ordinal'    
        
    # initialize 'keep' to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    # defining the data type 
    dtype = data[f].dtype
    
    # creating a dictionary that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    info.append(f_dict)

# collecting all meta-information into a meta dataframe
meta = pd.DataFrame(info, columns = ['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace = True)

In summary, variables in data are of level categorical, binary, ordinal (i.e. categorical variables with an ordering of levels) and interval (i.e. discrete or semi-continuous numerical variables). The meta data frame collects all these meta-information, by construction.

In [ ]:
# showing meta-information data frame
print(meta.shape)
meta

An aggregated view of meta is performed by grouping by role and level:

In [ ]:
# showing meta-information aggregated view
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

Frequency tables, bar-plots and histograms

Thanks to the meta-information encoding, a quick check on frequency tables for all variables in train is easily performed; for each feature level (i.e. binary, ordinal, categorical and interval) one generates a dictionary whose keys are the features of the selected level and values given by the frequency of the feature levels. More explicitly:

Binary

In [ ]:
# choose feature level 
level = 'binary'

# creating the dictionary with feature level counts 
ctabs = {}

for f in meta[(meta.level == level) & (meta.keep)].index:
    ctabs[f]=( pd.value_counts(data[f]) / data.shape[0] ) * 100
    
# printing the dictionary, with rounding of frequencies
for f in ctabs.keys():
    print(ctabs[f].round(2))
    print() 

Categorical

The 14 categorical variables show different numbers of levels; they are summarized using the following code:

In [ ]:
# categorical variables: summary of distinct levels

# choose feature level 
level = 'categorical'

for f in meta[(meta.level == 'categorical') & (meta.keep)].index:
    dist_values = data[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))
In [ ]:
# categorical variables: tabulation

# choose feature level 
level = 'categorical'

# creating the dictionary with feature level counts 
ctabs = {}

for f in meta[(meta.level == level) & (meta.keep)].index:
    ctabs[f] = ( pd.value_counts(data[f]) / data.shape[0] ) * 100
    
# printing the dictionary, with rounding of frequencies    
for f in ctabs.keys():
    print(ctabs[f].round(2))
    print() 

Interval

Interval variables are characterized by quite different distributions; in fact some variables are semi-continuous (e.g. ps_reg_03 or ps_car_12), while others are discretized ( e.g. ps_reg_01 or ps_calc_01): the code sample below provides an complete overview.

In [ ]:
# interval variables: tabulation 

# choose feature level 
level = 'interval'

# creating the dictionary with feature level counts (missing values are dropped)
ctabs = {}

for f in meta[(meta.level == level) & (meta.keep)].index:
    ctabs[f] = ( pd.value_counts(data[f]) / data.shape[0] ) * 100
    
# printing the dictionary, with rounding of frequencies    
for f in ctabs.keys():
    print(ctabs[f].round(2))
    print() 

Therefore, ps_reg_01, ps_calc_01, ps_calc_02, ps_calc_03 and ps_reg_02 are displayed with bar charts, while ps_reg_03, ps_car_12, ps_car_13, ps_car_14 and ps_car_15 with histograms (missing values are removed to improve visualization).

In [ ]:
# producing bar charts for selected variables
v = list(['ps_reg_01', 'ps_reg_02', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03'])
sns.set( rc = {'figure.figsize': (10, 10)})

for f in v:
    plt.figure()
    sns.countplot(x = f, data = data, linewidth=2, palette="Blues")
    plt.show()

The variables ps_calc_01, ps_calc_02, ps_calc_03 show an highly homogeneous distribution.

In [ ]:
# producing histograms for selected variables
v = list(['ps_reg_03', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15'])
sns.set( rc = {'figure.figsize': (10, 10)})

for f in v:
    plt.figure()
    sns.distplot(data.loc[data[f] != -1, f], kde = False)
    plt.show()

Variables ps_reg_03, ps_car_12, ps_car_13, ps_car_14 show distributions with right tails. Additional considerations on interval car variables are collected in the forthcoming section.

On interval car variables: ps_car_12, ..., ps_car_15

car variables are vehicle-related features, as stated in the Porto Seguro Kaggle challenge description. Therefore, it would interesting to inference on the nature of such variables, in the limits given by anonymization and lack of additional information. The most relevant case is represented by ps_car_15, which is obtained by square root of a feature with integer values

In [ ]:
squared = data['ps_car_15'] ** 2
squared.value_counts()

Therefore, one could suggest that ps_car_15**2 represents the car manufacture age with 0.0 =2000, 1.0 = 2001,..., 14.0 =2014, but no final confirmation of the statement is possible. Going backwards, one can try squaring the remaining variables and try to map results to typical car-relevant information for insurance, i.e. CC, weight, value etc. for the purpose of feature selection. However, in absence of additional information on the provided car variables, no additional analysis is performed.

Ordinal

The 16 ordinal variables are quickly displayed below. All variables except ps_ind_01, ps_ind_03, ps_ind_14, ps_ind_15 and ps_car_11 are of the calc type.

In [ ]:
# ordinal features: visualization

# choose feature level 
level = 'ordinal'

# producing histograms
v = meta[(meta.level == level) & (meta.keep)].index
sns.set( rc = {'figure.figsize': (10, 10)})

for f in v:
    plt.figure()
     
    sns.distplot(data[f], kde = False)
    plt.show()

ps_ind_14 shows a strong skewness between ordinal levels, with concentration at ps_ind_14 == 0 at 99% and monotonic decrease among levels. On the other hand, for ps_car_11 the trend is monotone increasing. Level 5 in ps_ind_01 breaks monotonicity. An overview of ordinal features is provided by tabulating the frequencies of levels, as done below.

In [ ]:
# ordinal features: tabulation

# creating the dictionary with feature level counts (missing values are dropped)
ctabs = {}
level = 'ordinal'

for f in meta[(meta.level == level) & (meta.keep)].index:
    ctabs[f] = ( pd.value_counts(data[f]) / data.shape[0] ) * 100
    
# printing the dictionary    
for f in ctabs.keys():
    print(ctabs[f].round(2))
    print() 

target variable: class imbalance

The target variable in the data dataset is highly imbalanced, as shown in the following code chunk:

In [ ]:
# levels for the target variable 
lev_target = ( pd.crosstab(index = data['target'], columns = 'Frequency') / data.shape[0] ) * 100
lev_target.round(2)

Data Visualization and Multivariate Analysis

Analysis of correlation for interval features

Pearson and Spearman correlation of interval variables is discussed.

In [ ]:
# Pearson correlation matrix: computation and visualization

# use method='pearson' resp. method='spearman' to compute Pearson resp. Spearman correlations
def corr_heatmap(v):
    correlations = data[v].corr(method='pearson')
    fig = plt.subplots(figsize=(10, 10))

    sns.heatmap(correlations,  center=0, fmt='.2f', cbar=False,
                square=True, linewidths=1, annot=True,  cmap="YlGnBu")
    plt.xticks(rotation=90) 
    plt.yticks(rotation=0) 
    plt.show()

# one applies the corr_heatmap function on the interval features    
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)

Scatterplots of high correlation variables are provided below, instead (warning: computationally intensive visualization).

In [ ]:
# scatterplot high correlation interval variables
import seaborn
high = pd.Index(['ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_12', 'ps_car_13', 'ps_car_15'])
pd.plotting.scatter_matrix(data[high], alpha = 0.2, figsize = (40, 40), diagonal = 'kde')

target vs. features

target vs. interval features

The 10 interval features can be plotted against the target variable in either jitter plots or boxplots for the purpose of visual exploration of their distributions. Below we show a simple code chunk for producing jitter plots for the variables ps_reg_03, ps_car_12 is provided; however, for sake of clarity in the exposition boxplots will be preferred.

In [ ]:
# jitter plots: an example
sns.set( rc = {'figure.figsize': (10, 10)})
feat = list(['ps_reg_03', 'ps_car_12'])

for f in feat:
    plt.figure()
    sns.stripplot(y=f, x='target', data=data, jitter=True, palette="Blues")
    plt.show()

Additional diagrams are given below. One starts with ps_reg_01:

In [ ]:
# ps_reg_01
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_reg_01', data=data, linewidth=2, palette="Blues")

No relevant discrimination of the variable distributions along the target levels is detected. Then ps_reg_02:

In [ ]:
# ps_reg_02
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x= "target", y ="ps_reg_02", data=data, linewidth=2, palette="Blues")

The variable distribution along the target=1 level shows a wider spread w.r.t. the distribution along target=0. Few outliers are detected. One continues with ps_reg_03:

In [ ]:
# ps_reg_03
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y ='ps_reg_03', data=data, linewidth=2, palette="Blues")

Relevant presence of outliers is identified; missing values are present for records with both 0 and 1 values of the target variable. On the other hand ps_car_12:

In [ ]:
# ps_car_12
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_12', data=data, linewidth=2, palette="Blues")

Missing values are not present for records with target=1. Moving to ps_car_13:

In [ ]:
# ps_car_13
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_13', data=data, linewidth=2, palette="Blues")

Strong presence of outliers, especially along target=0. Then ps_car_14:

In [ ]:
# ps_car_14
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_14', data=data, linewidth=2, palette="Blues")

Outliers and missing values for both target levels are detected. Then ps_car_15:

In [ ]:
# ps_car_15
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_15', data=data, linewidth=2, palette="Blues")

It shows the presence of both outliers and missing values for both target levels. We finish with calc features, starting with ps_calc_01:

In [ ]:
# ps_calc_01
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_calc_01', data=data, linewidth=2, palette="Blues")

Continuing with ps_calc_02

In [ ]:
# ps_calc_02
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_calc_02', data=data, linewidth=2, palette="Blues")

and finishing with ps_calc_03

In [ ]:
# ps_calc_03
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_calc_03', data=data,linewidth=2, palette="Blues")

All calc variables, i.e. ps_calc_01, ps_calc_02 and ps_calc_03 show no discrimination along the target variable levels.

target vs. binary, categorical and ordinal features

One moves to the visual analysis of plots of target against the remaining variables in data and cross-tabulations for binary data.

In [ ]:
# binary features
sns.set( rc = {'figure.figsize': (10, 10)})
feat = meta[(meta.level == 'binary') & (meta.keep)].index

for f in feat:
    plt.figure()
    sns.countplot(x=data[f], hue=data.target, data=data, palette="Blues")
    plt.show()
In [ ]:
# for binary make a cross-tabulation rows : levels columns: 0/1 in %; sum of every row = 100%
v = meta[(meta.level == 'binary') & (meta.keep)].index.drop('target')

for f in v:
    crosstab = pd.crosstab(index=data[f], columns=data['target'], margins=True) 
    cross = pd.DataFrame(data=crosstab.div(crosstab['All'], axis=0).drop('All', 1))
    cross['All'] = crosstab.iloc[:,2]
    print(cross)
    print()    
    print()

A quick visual check suggests that no binary variable presents 0 or 1 levels showing a significant deviation from the global 96.4-3.6 noise-signal ratio. Moreover, no zero frequency cell is identified; zero cells can lead to numerical instability during the training routines (e.g. for logistic regression models). One continues with cross-tabulation of target against ordinal variables.

In [ ]:
 # for ordinal make a cross-tabulation rows : levels columns: 0/1 in %; sum of every row = 100%
v = meta[(meta.level == 'ordinal') & (meta.keep)].index

for f in v:
    crosstab = pd.crosstab(index=data[f], columns=data['target'], margins=True) 
    cross = pd.DataFrame(data=crosstab.div(crosstab['All'], axis=0).drop('All', 1))
    cross['All'] = crosstab.iloc[:,2]
    print(cross)
    print()    
    print()    

calc variables are characterized by zero cells for a limited number of variable levels. For example, zero cells appear for ps_calc_12 at levels ps_calc_12 == 9 and ps_calc_12 == 10. The number of records associated to levels for which complete separation holds, i.e. the emergence of zero cells in the cross-tabulation against target, is extremely low for all calc variables. In absence of additional information on provided data, it is not possible to produce inference on those data points, which could represent outliers to remove before modeling or records for which the absence of target == 1 is fully justified by business considerations.

Feature Engineering & Data Preparation for Modeling

In this section we perform a series of transformations on data to pave the way to predictive modeling.

On numerical calc variables and univariate screening

As mentioned in the document, the numerical calc variables will be dropped from further analysis, starting with a quick structural check on data and ending with a drop check.

In [ ]:
print('Structure of data before calc variable drop:', data.shape)
In [ ]:
# dropping 'ps_calc_01',... 'ps_calc_14' variables and updating meta information
vars_to_drop = ['ps_calc_01', 'ps_calc_02','ps_calc_03','ps_calc_04','ps_calc_05','ps_calc_06','ps_calc_07','ps_calc_08','ps_calc_09','ps_calc_10','ps_calc_11','ps_calc_12','ps_calc_13','ps_calc_14']
data.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False  
In [ ]:
print('Structure of data after calc variable drop:', data.shape)
In [ ]:
data.info()

Imputation of missing values

As discussed in the document, a straighforward imputation scheme is chosen and applied in the forthcoming code chunk:

In [ ]:
# dropping 'ps_car_03_cat', 'ps_car_05_cat' and updating meta information
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
data.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False  

# imputing with the mean or mode using Imputer from sklearn.preprocessing
from sklearn.preprocessing import Imputer

mean_imp = Imputer(missing_values = -1, strategy = 'mean', axis = 0)
mode_imp = Imputer(missing_values = -1, strategy = 'most_frequent', axis = 0)

data['ps_reg_03'] = mean_imp.fit_transform(data[['ps_reg_03']]).ravel()
data['ps_car_12'] = mean_imp.fit_transform(data[['ps_car_12']]).ravel()
data['ps_car_14'] = mean_imp.fit_transform(data[['ps_car_14']]).ravel()
data['ps_car_11'] = mode_imp.fit_transform(data[['ps_car_11']]).ravel()

After imputation one has

In [ ]:
print(data.shape)

Dummies

Dummy variables for categorical features in data are created; the function get_dummies converts categorical variables into dummies dropping the original categorical variable from which the corresponding dummies are created from the resulting dataset and the first level. One-hot-encoding does not drop the first level, instead.

In [ ]:
# selecting categorical variables
v = meta[(meta.level == 'categorical') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(data.shape[1]))

# creating dummy variables
data = pd.get_dummies(data, columns = v, drop_first = True)
print('After dummification we have {} variables in data'.format(data.shape[1]))

After generation of dummy variables, the memory usage of the data dataset is increased:

In [ ]:
print('Memory usage of `data` (in bytes):', pd.DataFrame.memory_usage(data,index=True, deep = True).sum())

One can check that no categorical variable is present in the resulting dataset, as expected:

In [ ]:
v = meta[(meta.level == 'categorical') & (meta.keep)].index
data[v]

A quick check on the columns of data after dummification ends this section:

In [ ]:
print(data.columns.values)

On Randomness

We now fix a random_state for reproducibility of results; it will be used in both train vs. test dataset splitting and modeling.

In [ ]:
random_state = 123

train and test datasets

As mentioned in the document, data dataset is randomly split into train and test datasets. A not stratified approach is chosen and removal of both columns id and target from the training dataset is performed.

In [ ]:
# split 80-20% (no stratification)
X_train, X_test, y_train, y_test = train_test_split(data.drop(['id', 'target'], axis=1), 
                                                    data['target'], 
                                                    test_size=0.2,
                                                    random_state=random_state
                                                   )

After the split, the following checks on the train and test datasets are performed:

In [ ]:
# structural checks
print('Training dataset - dimensions:', X_train.shape)
print('Test dataset - dimensions:', X_test.shape)
print()
print('Random split check:', X_train.shape[0] + X_test.shape[0] == data.shape[0])
print()

# imbalancing: check
lev_target = ( pd.crosstab(index = data['target'], columns = 'count') / data.shape[0] ) * 100
lev_target_train = ( pd.crosstab(index = y_train, columns = 'count') / y_train.shape[0] ) * 100
lev_target_test = ( pd.crosstab(index = y_test, columns = 'count') / y_test.shape[0] ) * 100

print('target class imbalance data:')
print(lev_target)
print()
print('target class imbalance train:')
print(lev_target_train)
print()
print('target class imbalance test:')
print(lev_target_test)

We then remove the original dataset data to free up some memory.

In [ ]:
del data

We can export both X_train and X_test for later use, if needed:

In [ ]:
# insert paths to export training and test data sets
X_train.to_csv('...')
X_test.to_csv('...')

Modeling

Modeling Strategies: short description

For the overview of the forthcoming modeling strategies we refer to the document.

On the Porto Seguro challenge performance metric: normalized Gini coefficient

The evaluation of fitted model on out-of-sample data is performed in the Porto Seguro Kaggle challenge by considering an ad-hoc performance measure, called normalized Gini coefficient, as discussed here. The code for the computation of the normalized Gini coefficient is provided in this thread, for different programming languages. Below the Python implementation is presented.

In [ ]:
from sklearn.metrics import make_scorer

# Gini coefficient
def gini(actual, pred):
    
    # a structural check
    assert (len(actual) == len(pred))
    
    # introducing an array called all
    all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)  #slicing along second axis
    
    # sorting the array along predicted probabilities (descending order) and along the index axis all[:, 2] in case of ties
    all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]                             #

    # towards the Gini coefficient
    totalLosses = all[:, 0].sum()
    giniSum = all[:, 0].cumsum().sum() / totalLosses

    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)

# normalized Gini coefficient
def gini_normalized_score(actual, pred):
    return gini(actual, pred) / gini(actual, actual)

# score using the normalized Gini
score_gini = make_scorer(gini_normalized_score, greater_is_better=True, needs_threshold = True)

Boosting: AdaBoost

We start the boosting analysis with AdaBoost. We do implement 3 separate strategies: ST0, ST1 and ST2.

ST0: default AdaBoost with AdaBoostClassifier()

Default AdaBoost with AdaBoostClassifier() is used for baseline modeling; we compare both SAMME and SAMME.R algorithms.

In [ ]:
# weak learner: tree stump 
tree = DecisionTreeClassifier(criterion='gini',                 #only Gini and information gain criteria are supported
                              random_state= random_state,       #DecisionTreeClassifier() is not fully deterministic 
                              max_depth=1)                      #stump
tree = tree.fit(X_train, y_train)

# SAMME AdaBoost
ada = AdaBoostClassifier(base_estimator=tree, 
                         algorithm='SAMME',
                         random_state= random_state)
ada = ada.fit(X_train, y_train)

# SAMME.R AdaBoost
ada_R = AdaBoostClassifier(base_estimator=tree, 
                           algorithm='SAMME.R', 
                           random_state= random_state)
ada_R = ada_R.fit(X_train, y_train)

The default AdaBoost implementation uses n_estimators=50 and learning_rate=1.0. Out-of-sample performance for both SAMME and SAMME.R is computed below:

In [ ]:
# SAMME AdaBoost classifier performance
y_pred_proba_ada = ada.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_ada[:, 1])
roc_auc = auc(fpr, tpr)

# SAMME.R AdaBoost classifier performance
y_pred_proba_ada_R = ada_R.predict_proba(X_test)
fpr_R, tpr_R, _ = roc_curve(y_test, y_pred_proba_ada_R[:, 1])
roc_auc_R = auc(fpr_R, tpr_R)

# AUC on test dataset
print('AUC SAMME AdaBoost:', roc_auc_score(y_test, y_pred_proba_ada[:, 1]).round(3))
print('AUC SAMME.R AdaBoost:', roc_auc_score(y_test, y_pred_proba_ada_R[:, 1]).round(3))

ST1 or interlude: explorative comparison between SAMME and SAMME.R

In [ ]:
# number of boosting iterations
n_estimators = 

# tree stump and its out-of-sample error
tree = DecisionTreeClassifier(criterion='gini', 
                               random_state=random_state,
                               max_depth=1)
tree.fit(X_train, y_train)

# SAMME AdaBoost
ada = AdaBoostClassifier(base_estimator=tree,
                         algorithm='SAMME',
                         n_estimators=n_estimators,
                         random_state=random_state
                        )
ada.fit(X_train, y_train)

# SAMME.R AdaBoost
ada_R = AdaBoostClassifier(base_estimator=tree,
                           algorithm='SAMME.R',
                           n_estimators=n_estimators,
                           random_state=random_state
                          )       
ada_R.fit(X_train, y_train)
In [ ]:
# plotting in-sample and out-of-sample error of both SAMME and SAMME.R AdaBoost algorithms as function of boosting iterations.
# the max number of boosting iterations is given by n_estimators
fig = plt.figure()
ax = fig.add_subplot(111)

ada_err_test = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada.staged_predict_proba(X_test)):
    ada_err_test[i] = roc_auc_score(y_test, y_pred[:,1])
     
ada_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada.staged_predict_proba(X_train)):
    ada_err_train[i] =  roc_auc_score(y_train,  y_pred[:,1])        
        
ada_R_err_test = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_R.staged_predict_proba(X_test)):
    ada_R_err_test[i] =  roc_auc_score(y_test, y_pred[:,1])

ada_R_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_R.staged_predict_proba(X_train)):
    ada_R_err_train[i] =  roc_auc_score(y_train, y_pred[:,1])

ax.plot(np.arange(n_estimators) + 1, ada_err_test,
        label='SAMME AdaBoost Test AUC',
        color='red')

ax.plot(np.arange(n_estimators) + 1, ada_err_train,
        label='SAMME AdaBoost Train AUC',
        color='blue')

ax.plot(np.arange(n_estimators) + 1, ada_R_err_test,
        label='SAMME.R AdaBoost Test AUC',
        color='orange')

ax.plot(np.arange(n_estimators) + 1, ada_R_err_train,
        label='SAMME.R AdaBoost Train AUC',
        color='green')        
        
ax.set_ylim((0.5, 0.8))
ax.set_xlabel('n_estimators')
ax.set_ylabel('AUC')

leg = ax.legend(loc='lower right', fancybox=True)
leg.get_frame().set_alpha(0.7) 
In [ ]:
# save plot (insert path to .png and .pdf)
plt.savefig(r'...png')
plt.savefig(r'...pdf')
In [ ]:
# max Gini coefficient for each curve

# ada
print('Max SAMME train Gini:', (2*np.amax(ada_err_train)-1))
print('Max SAMME test Gini:',(2*np.amax(ada_err_test)-1))

# ada_R
print('Max SAMME.R train Gini:',(2*np.amax(ada_R_err_train)-1))
print('Max SAMME.R test Gini:',(2*np.amax(ada_R_err_test)-1))
In [ ]:
# similarly, max AUC for each curve

print('Max SAMME train AUC:', np.amax(ada_err_train))
print('Max SAMME test AUC:',np.amax(ada_err_test))

# ada_R
print('Max SAMME.R train AUC:', np.amax(ada_R_err_train))
print('Max SAMME.R test AUC:', np.amax(ada_R_err_test))
In [ ]:
# addendum: retrieve position maximum test AUC SAMME.R
np.argmax(ada_R_err_test)  # if x returned, the number of iterations is x+1

ST2:hyper parameter tuning: n_estimators, max_depth and learning_rate

In [ ]:
# GridSearch AdaBoost optimization (change the hyperparameters according to the run under consideration)
param_grid = {                                     
              'learning_rate': [0.001, 0.001, 0.1, 1], 
              'n_estimators': [100, 200, 300, 400] 
             } 

# tree as base learner
tree = tree = DecisionTreeClassifier(criterion='gini',
                                     random_state=random_state,
                                     max_depth=1)

boost = AdaBoostClassifier(base_estimator=tree,
                           random_state=random_state)

# cross-validation
cv = StratifiedKFold(n_splits=5, 
                     shuffle=False, 
                     random_state=random_state)

grid_boost = GridSearchCV(estimator=boost, 
                          param_grid=param_grid, 
                          scoring=score_gini, 
                          cv=cv, 
                          verbose=10)

grid_boost.fit(X_train, y_train)
In [ ]:
# best model/classifier
clf_b = grid_boost.best_estimator_
clf_b
In [ ]:
# computing out-of-sample AUC for best classifier
clf_b.fit(X_train, y_train)

# test data: computing AUC
y_scores = clf_b.decision_function(X_test) 
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

# plotting ROC and showing AUC (on test data)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()    

# AUC on test dataset
print('AUC:', roc_auc_score(y_test, y_scores).round(3))

Boosting: XGBoost

We continue the boosting analysis with XGBoost. We do implement 2 separate strategies: ST0 and ST1.

ST0: default XGBoost with XGBClassifier()

Below the hyper-parameters of the default XGBClassifier() implementation are presented and fitting on X_train is performed.

In [ ]:
# fit model default hyper-parameters
xgb_default = XGBClassifier(random_state=random_state)
xgb_default.set_params()
In [ ]:
# fitting the default XGBoostClassifier()
xgb_default.fit(X_train, y_train)
In [ ]:
# default XGBoost classifier performance
y_pred_proba_xgb_default = xgb_default.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_xgb_default[:, 1])
roc_auc = auc(fpr, tpr)

# AUC on test dataset
print('AUC:', roc_auc_score(y_test, y_pred_proba_xgb_default[:, 1]).round(3))

ST1: hyper parameter tuning:max_depth, learning_rate, n_estimators with fixed subsampling

In [ ]:
# GridSearch XGBoost optimization (change the hyperparameters according to the run under consideration)
param_grid = {'max_depth': [1],                                                            
              'learning_rate': [0.001, 0.01, 0.1, 1],
              'n_estimators': [100, 300, 500],
              'subsample': [0.5],
              'colsample_bytree': [0.75]
             } 

# cross-validation
cv = StratifiedKFold(n_splits=5, 
                     shuffle=False, 
                     random_state=random_state)

xgb = GridSearchCV(estimator=XGBClassifier(random_state=random_state), 
                   param_grid=param_grid, 
                   scoring=score_gini, 
                   cv=cv, 
                   verbose=10)

xgb.fit(X_train, y_train)
In [ ]:
# best score on CV-data
print('Best Gini on CV-data:', xgb.best_score_)

# parameters for best model
print('Best set of hyperparameters:', xgb.best_params_)
In [ ]:
# best xgb model
xgb_best = xgb.best_estimator_
xgb_best.fit(X_train, y_train)

# best xgb predictions on test data and AUC
y_pred_proba_xgb = xgb_best.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_xgb[:, 1])
roc_auc = auc(fpr, tpr)

# plotting ROC and showing AUC (on test data)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()    

# AUC on test dataset
print('AUC:', roc_auc_score(y_test, y_pred_proba_xgb[:, 1]).round(3))

And now we plot the feature importance from the best XGBoost model so far in ST1.

In [ ]:
# fitting best model (again, if not saved beforehand)
xgb_best_ov = XGBClassifier(max_depth=,                                      
                            learning_rate=,
                            n_estimators=,
                            subsample=0.5,
                            colsample_bytree=0.75,
                            random_state=random_state)
xgb_best_ov.fit(X_train, y_train)
In [ ]:
# plotting feature importance graph: top 10 features
plot_importance(xgb_best_ov, max_num_features=10)