Predictive modeling on data from the Porto Seguro’s Safe Driver Prediction competition hosted on the Kaggle platform is performed leveraging machine learning boosting algorithms (AdaBoost
and XGBoost
). We refer to the document "On Boosting: Theory and Applications" to complement the analysis presented in this notebook.
In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility.
# Notebook settings
###################
# resetting variables
get_ipython().magic('reset -sf')
# formatting: cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# plotting
%matplotlib inline
# importing xgboost REMARK: run this cell only if other imports failed. Delete it in case xgboost has been already imported
import pip
pip.main(['install', 'xgboost'])
# loading Python packages
#########################
# scientific packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy
# boosting
import xgboost
from xgboost import XGBClassifier
from xgboost import plot_importance
# scipy
from scipy.stats import chi2
# pandas: selected modules and functions
from pandas.plotting import scatter_matrix
# sklearn: selected modules and functions
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.utils import shuffle
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import zero_one_loss
# utilities
from datetime import datetime
For the given Kaggle competition, two datasets are provided, one for training data and the other for generating predictions to be submitted. The latter lacks of the target variable (denoted by target
in the former): it will not be imported in this notebook. Therefore, the whole analysis is carried out on the original 'training' data. A remark on data acquisition procedure: a direct download from the Kaggle Porto Seguro Challenge website in a landing folder on a local machine is performed; original data are then copied in a separate folder and there unzipped. Import of unzipped data is performed using pandas
as follows:
# data import: specify the path to the competition training data
path = '...'
data = pd.read_csv(path)
Imported dataset, i.e. data
, comprises of 595212 records and 59 columns:
print('Structure of imported data:', data.shape)
Memory usage of data
:
print('Memory usage of `data` (in bytes):', pd.DataFrame.memory_usage(data, index=True, deep=True).sum())
From the data description on the Kaggle Porto Seguro Challenge website the following information on the features in data
are provided:
ind
, reg
, car
, calc
).bin
to indicate binary features and cat
to indicate categorical features. -1
indicate that the feature was missing from the observation. target
signifies whether or not a claim was filed for that policy holder.The variable target
is therefore the label used to train the classifiers. Data types for all columns in data
are shown below:
# data types
data.info()
Both integer and float data types are present; the variable id
uniquely identifies data records; a quick check shows that no duplicate record exists:
# duplicates? None
data.drop_duplicates().shape
One continues by checking the first 10 entries of the data
dataset before moving to a high level summary using describe()
.
# imported data: first 10 entries
data.head(10).T
# imported data: statistical summaries with describe
data.describe().T
Missing values are encoded with -1
, as discussed in the official data description on the Kaggle Porto Seguro Challenge website; therefore the following code is implemented to have an overview of all variables with missing values
# missing values (encoded as -1)
feat_missing = []
for f in data.columns:
missings = data[data[f] == -1][f].count()
if missings > 0:
feat_missing.append(f)
missings_perc = missings/data.shape[0]
# printing summary of missing values
print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
# how many variables do present missing values?
print()
print('In total, there are {} variables with missing values'.format(len(feat_missing)))
Missing value imputation will be discussed in section Imputation of missing values.
Due to the high number and different types of the features in data
, univariate analysis is performed to gather insights on the provided data, as shown in the forthcoming section.
One starts the univariate analysis of data
by recording the feature type into a meta-information data frame, as shown in B. Carremans' kernel. This encoding will be extremely useful in the following, allowing for a feature type-targeted analyisis.
# B. Carremans: recording meta-information for each column in train, following the official data description on the Kaggle Porto Seguro Challenge
info = []
for f in data.columns:
# defining the role (target and id have to be separated from the other features)
if f == 'target':
role = 'target'
elif f == 'id':
role = 'id'
else:
role = 'input'
# defining the levels
# _bin postfix = binary feature (target is binary as well)
if 'bin' in f or f == 'target':
level = 'binary'
# _cat postfix = categorical feature
elif 'cat' in f or f == 'id':
level = 'categorical'
# continuous or ordinal features: those which are neither _bin nor _cat
elif data[f].dtype == float:
level = 'interval'
else:
level = 'ordinal'
# initialize 'keep' to True for all variables except for id
keep = True
if f == 'id':
keep = False
# defining the data type
dtype = data[f].dtype
# creating a dictionary that contains all the metadata for the variable
f_dict = {
'varname': f,
'role': role,
'level': level,
'keep': keep,
'dtype': dtype
}
info.append(f_dict)
# collecting all meta-information into a meta dataframe
meta = pd.DataFrame(info, columns = ['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace = True)
In summary, variables in data
are of level categorical
, binary
, ordinal
(i.e. categorical variables with an ordering of levels) and interval
(i.e. discrete or semi-continuous numerical variables). The meta
data frame collects all these meta-information, by construction.
# showing meta-information data frame
print(meta.shape)
meta
An aggregated view of meta
is performed by grouping by role
and level
:
# showing meta-information aggregated view
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
Thanks to the meta-information encoding, a quick check on frequency tables for all variables in train
is easily performed; for each feature level (i.e. binary
, ordinal
, categorical
and interval
) one generates a dictionary whose keys are the features of the selected level and values given by the frequency of the feature levels. More explicitly:
# choose feature level
level = 'binary'
# creating the dictionary with feature level counts
ctabs = {}
for f in meta[(meta.level == level) & (meta.keep)].index:
ctabs[f]=( pd.value_counts(data[f]) / data.shape[0] ) * 100
# printing the dictionary, with rounding of frequencies
for f in ctabs.keys():
print(ctabs[f].round(2))
print()
The 14
categorical variables show different numbers of levels; they are summarized using the following code:
# categorical variables: summary of distinct levels
# choose feature level
level = 'categorical'
for f in meta[(meta.level == 'categorical') & (meta.keep)].index:
dist_values = data[f].value_counts().shape[0]
print('Variable {} has {} distinct values'.format(f, dist_values))
# categorical variables: tabulation
# choose feature level
level = 'categorical'
# creating the dictionary with feature level counts
ctabs = {}
for f in meta[(meta.level == level) & (meta.keep)].index:
ctabs[f] = ( pd.value_counts(data[f]) / data.shape[0] ) * 100
# printing the dictionary, with rounding of frequencies
for f in ctabs.keys():
print(ctabs[f].round(2))
print()
Interval variables are characterized by quite different distributions; in fact some variables are semi-continuous (e.g. ps_reg_03
or ps_car_12
), while others are discretized ( e.g. ps_reg_01
or ps_calc_01
): the code sample below provides an complete overview.
# interval variables: tabulation
# choose feature level
level = 'interval'
# creating the dictionary with feature level counts (missing values are dropped)
ctabs = {}
for f in meta[(meta.level == level) & (meta.keep)].index:
ctabs[f] = ( pd.value_counts(data[f]) / data.shape[0] ) * 100
# printing the dictionary, with rounding of frequencies
for f in ctabs.keys():
print(ctabs[f].round(2))
print()
Therefore, ps_reg_01
, ps_calc_01
, ps_calc_02
, ps_calc_03
and ps_reg_02
are displayed with bar charts, while ps_reg_03
, ps_car_12
, ps_car_13
, ps_car_14
and ps_car_15
with histograms (missing values are removed to improve visualization).
# producing bar charts for selected variables
v = list(['ps_reg_01', 'ps_reg_02', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03'])
sns.set( rc = {'figure.figsize': (10, 10)})
for f in v:
plt.figure()
sns.countplot(x = f, data = data, linewidth=2, palette="Blues")
plt.show()
The variables ps_calc_01
, ps_calc_02
, ps_calc_03
show an highly homogeneous distribution.
# producing histograms for selected variables
v = list(['ps_reg_03', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15'])
sns.set( rc = {'figure.figsize': (10, 10)})
for f in v:
plt.figure()
sns.distplot(data.loc[data[f] != -1, f], kde = False)
plt.show()
Variables ps_reg_03
, ps_car_12
, ps_car_13
, ps_car_14
show distributions with right tails. Additional considerations on interval car
variables are collected in the forthcoming section.
car
variables: ps_car_12
, ..., ps_car_15
¶car
variables are vehicle-related features, as stated in the Porto Seguro Kaggle challenge description. Therefore, it would interesting to inference on the nature of such variables, in the limits given by anonymization and lack of additional information. The most relevant case is represented by ps_car_15
, which is obtained by square root of a feature with integer values
squared = data['ps_car_15'] ** 2
squared.value_counts()
Therefore, one could suggest that ps_car_15**2
represents the car manufacture age with 0.0 =2000, 1.0 = 2001,..., 14.0 =2014, but no final confirmation of the statement is possible. Going backwards, one can try squaring the remaining variables and try to map results to typical car-relevant information for insurance, i.e. CC, weight, value etc. for the purpose of feature selection. However, in absence of additional information on the provided car
variables, no additional analysis is performed.
The 16
ordinal variables are quickly displayed below. All variables except ps_ind_01
, ps_ind_03
, ps_ind_14
, ps_ind_15
and ps_car_11
are of the calc
type.
# ordinal features: visualization
# choose feature level
level = 'ordinal'
# producing histograms
v = meta[(meta.level == level) & (meta.keep)].index
sns.set( rc = {'figure.figsize': (10, 10)})
for f in v:
plt.figure()
sns.distplot(data[f], kde = False)
plt.show()
ps_ind_14
shows a strong skewness between ordinal levels, with concentration at ps_ind_14 == 0
at 99% and monotonic decrease among levels. On the other hand, for ps_car_11
the trend is monotone increasing. Level 5
in ps_ind_01
breaks monotonicity. An overview of ordinal
features is provided by tabulating the frequencies of levels, as done below.
# ordinal features: tabulation
# creating the dictionary with feature level counts (missing values are dropped)
ctabs = {}
level = 'ordinal'
for f in meta[(meta.level == level) & (meta.keep)].index:
ctabs[f] = ( pd.value_counts(data[f]) / data.shape[0] ) * 100
# printing the dictionary
for f in ctabs.keys():
print(ctabs[f].round(2))
print()
target
variable: class imbalance¶The target
variable in the data
dataset is highly imbalanced, as shown in the following code chunk:
# levels for the target variable
lev_target = ( pd.crosstab(index = data['target'], columns = 'Frequency') / data.shape[0] ) * 100
lev_target.round(2)
Pearson and Spearman correlation of interval variables is discussed.
# Pearson correlation matrix: computation and visualization
# use method='pearson' resp. method='spearman' to compute Pearson resp. Spearman correlations
def corr_heatmap(v):
correlations = data[v].corr(method='pearson')
fig = plt.subplots(figsize=(10, 10))
sns.heatmap(correlations, center=0, fmt='.2f', cbar=False,
square=True, linewidths=1, annot=True, cmap="YlGnBu")
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
# one applies the corr_heatmap function on the interval features
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)
Scatterplots of high correlation variables are provided below, instead (warning: computationally intensive visualization).
# scatterplot high correlation interval variables
import seaborn
high = pd.Index(['ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_12', 'ps_car_13', 'ps_car_15'])
pd.plotting.scatter_matrix(data[high], alpha = 0.2, figsize = (40, 40), diagonal = 'kde')
target
vs. features¶target
vs. interval features¶The 10
interval
features can be plotted against the target
variable in either jitter plots or boxplots for the purpose of visual exploration of their distributions. Below we show a simple code chunk for producing jitter plots for the variables ps_reg_03
, ps_car_12
is provided; however, for sake of clarity in the exposition boxplots will be preferred.
# jitter plots: an example
sns.set( rc = {'figure.figsize': (10, 10)})
feat = list(['ps_reg_03', 'ps_car_12'])
for f in feat:
plt.figure()
sns.stripplot(y=f, x='target', data=data, jitter=True, palette="Blues")
plt.show()
Additional diagrams are given below. One starts with ps_reg_01
:
# ps_reg_01
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_reg_01', data=data, linewidth=2, palette="Blues")
No relevant discrimination of the variable distributions along the target
levels is detected. Then ps_reg_02
:
# ps_reg_02
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x= "target", y ="ps_reg_02", data=data, linewidth=2, palette="Blues")
The variable distribution along the target=1
level shows a wider spread w.r.t. the distribution along target=0
. Few outliers are detected. One continues with ps_reg_03
:
# ps_reg_03
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y ='ps_reg_03', data=data, linewidth=2, palette="Blues")
Relevant presence of outliers is identified; missing values are present for records with both 0
and 1
values of the target
variable. On the other hand ps_car_12
:
# ps_car_12
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_12', data=data, linewidth=2, palette="Blues")
Missing values are not present for records with target=1
. Moving to ps_car_13
:
# ps_car_13
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_13', data=data, linewidth=2, palette="Blues")
Strong presence of outliers, especially along target=0
. Then ps_car_14
:
# ps_car_14
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_14', data=data, linewidth=2, palette="Blues")
Outliers and missing values for both target
levels are detected. Then ps_car_15
:
# ps_car_15
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_car_15', data=data, linewidth=2, palette="Blues")
It shows the presence of both outliers and missing values for both target
levels. We finish with calc
features, starting with ps_calc_01
:
# ps_calc_01
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_calc_01', data=data, linewidth=2, palette="Blues")
Continuing with ps_calc_02
# ps_calc_02
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_calc_02', data=data, linewidth=2, palette="Blues")
and finishing with ps_calc_03
# ps_calc_03
sns.set( rc = {'figure.figsize': (10, 10)})
sns.boxplot(x='target', y='ps_calc_03', data=data,linewidth=2, palette="Blues")
All calc
variables, i.e. ps_calc_01
, ps_calc_02
and ps_calc_03
show no discrimination along the target
variable levels.
target
vs. binary, categorical and ordinal features¶One moves to the visual analysis of plots of target
against the remaining variables in data
and cross-tabulations for binary
data.
# binary features
sns.set( rc = {'figure.figsize': (10, 10)})
feat = meta[(meta.level == 'binary') & (meta.keep)].index
for f in feat:
plt.figure()
sns.countplot(x=data[f], hue=data.target, data=data, palette="Blues")
plt.show()
# for binary make a cross-tabulation rows : levels columns: 0/1 in %; sum of every row = 100%
v = meta[(meta.level == 'binary') & (meta.keep)].index.drop('target')
for f in v:
crosstab = pd.crosstab(index=data[f], columns=data['target'], margins=True)
cross = pd.DataFrame(data=crosstab.div(crosstab['All'], axis=0).drop('All', 1))
cross['All'] = crosstab.iloc[:,2]
print(cross)
print()
print()
A quick visual check suggests that no binary variable presents 0
or 1
levels showing a significant deviation from the global 96.4-3.6 noise-signal ratio. Moreover, no zero frequency cell is identified; zero cells can lead to numerical instability during the training routines (e.g. for logistic regression models). One continues with cross-tabulation of target
against ordinal variables.
# for ordinal make a cross-tabulation rows : levels columns: 0/1 in %; sum of every row = 100%
v = meta[(meta.level == 'ordinal') & (meta.keep)].index
for f in v:
crosstab = pd.crosstab(index=data[f], columns=data['target'], margins=True)
cross = pd.DataFrame(data=crosstab.div(crosstab['All'], axis=0).drop('All', 1))
cross['All'] = crosstab.iloc[:,2]
print(cross)
print()
print()
calc
variables are characterized by zero cells for a limited number of variable levels. For example, zero cells appear for ps_calc_12
at levels ps_calc_12 == 9
and ps_calc_12 == 10
. The number of records associated to levels for which complete separation holds, i.e. the emergence of zero cells in the cross-tabulation against target
, is extremely low for all calc
variables. In absence of additional information on provided data, it is not possible to produce inference on those data points, which could represent outliers to remove before modeling or records for which the absence of target == 1
is fully justified by business considerations.
In this section we perform a series of transformations on data
to pave the way to predictive modeling.
calc
variables and univariate screening¶As mentioned in the document, the numerical calc
variables will be dropped from further analysis, starting with a quick structural check on data
and ending with a drop check.
print('Structure of data before calc variable drop:', data.shape)
# dropping 'ps_calc_01',... 'ps_calc_14' variables and updating meta information
vars_to_drop = ['ps_calc_01', 'ps_calc_02','ps_calc_03','ps_calc_04','ps_calc_05','ps_calc_06','ps_calc_07','ps_calc_08','ps_calc_09','ps_calc_10','ps_calc_11','ps_calc_12','ps_calc_13','ps_calc_14']
data.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False
print('Structure of data after calc variable drop:', data.shape)
data.info()
As discussed in the document, a straighforward imputation scheme is chosen and applied in the forthcoming code chunk:
# dropping 'ps_car_03_cat', 'ps_car_05_cat' and updating meta information
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
data.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False
# imputing with the mean or mode using Imputer from sklearn.preprocessing
from sklearn.preprocessing import Imputer
mean_imp = Imputer(missing_values = -1, strategy = 'mean', axis = 0)
mode_imp = Imputer(missing_values = -1, strategy = 'most_frequent', axis = 0)
data['ps_reg_03'] = mean_imp.fit_transform(data[['ps_reg_03']]).ravel()
data['ps_car_12'] = mean_imp.fit_transform(data[['ps_car_12']]).ravel()
data['ps_car_14'] = mean_imp.fit_transform(data[['ps_car_14']]).ravel()
data['ps_car_11'] = mode_imp.fit_transform(data[['ps_car_11']]).ravel()
After imputation one has
print(data.shape)
Dummy variables for categorical features in data
are created; the function get_dummies
converts categorical variables into dummies dropping the original categorical variable from which the corresponding dummies are created from the resulting dataset and the first level. One-hot-encoding does not drop the first level, instead.
# selecting categorical variables
v = meta[(meta.level == 'categorical') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(data.shape[1]))
# creating dummy variables
data = pd.get_dummies(data, columns = v, drop_first = True)
print('After dummification we have {} variables in data'.format(data.shape[1]))
After generation of dummy variables, the memory usage of the data
dataset is increased:
print('Memory usage of `data` (in bytes):', pd.DataFrame.memory_usage(data,index=True, deep = True).sum())
One can check that no categorical variable is present in the resulting dataset, as expected:
v = meta[(meta.level == 'categorical') & (meta.keep)].index
data[v]
A quick check on the columns of data
after dummification ends this section:
print(data.columns.values)
We now fix a random_state
for reproducibility of results; it will be used in both train vs. test dataset splitting and modeling.
random_state = 123
train
and test
datasets¶As mentioned in the document, data
dataset is randomly split into train
and test
datasets. A not stratified approach is chosen and removal of both columns id
and target
from the training dataset is performed.
# split 80-20% (no stratification)
X_train, X_test, y_train, y_test = train_test_split(data.drop(['id', 'target'], axis=1),
data['target'],
test_size=0.2,
random_state=random_state
)
After the split, the following checks on the train
and test
datasets are performed:
# structural checks
print('Training dataset - dimensions:', X_train.shape)
print('Test dataset - dimensions:', X_test.shape)
print()
print('Random split check:', X_train.shape[0] + X_test.shape[0] == data.shape[0])
print()
# imbalancing: check
lev_target = ( pd.crosstab(index = data['target'], columns = 'count') / data.shape[0] ) * 100
lev_target_train = ( pd.crosstab(index = y_train, columns = 'count') / y_train.shape[0] ) * 100
lev_target_test = ( pd.crosstab(index = y_test, columns = 'count') / y_test.shape[0] ) * 100
print('target class imbalance data:')
print(lev_target)
print()
print('target class imbalance train:')
print(lev_target_train)
print()
print('target class imbalance test:')
print(lev_target_test)
We then remove the original dataset data
to free up some memory.
del data
We can export both X_train
and X_test
for later use, if needed:
# insert paths to export training and test data sets
X_train.to_csv('...')
X_test.to_csv('...')
For the overview of the forthcoming modeling strategies we refer to the document.
The evaluation of fitted model on out-of-sample data is performed in the Porto Seguro Kaggle challenge by considering an ad-hoc performance measure, called normalized Gini coefficient, as discussed here. The code for the computation of the normalized Gini coefficient is provided in this thread, for different programming languages. Below the Python implementation is presented.
from sklearn.metrics import make_scorer
# Gini coefficient
def gini(actual, pred):
# a structural check
assert (len(actual) == len(pred))
# introducing an array called all
all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float) #slicing along second axis
# sorting the array along predicted probabilities (descending order) and along the index axis all[:, 2] in case of ties
all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))] #
# towards the Gini coefficient
totalLosses = all[:, 0].sum()
giniSum = all[:, 0].cumsum().sum() / totalLosses
giniSum -= (len(actual) + 1) / 2.
return giniSum / len(actual)
# normalized Gini coefficient
def gini_normalized_score(actual, pred):
return gini(actual, pred) / gini(actual, actual)
# score using the normalized Gini
score_gini = make_scorer(gini_normalized_score, greater_is_better=True, needs_threshold = True)
We start the boosting analysis with AdaBoost. We do implement 3 separate strategies: ST0, ST1 and ST2.
AdaBoostClassifier()
¶Default AdaBoost with AdaBoostClassifier()
is used for baseline modeling; we compare both SAMME and SAMME.R algorithms.
# weak learner: tree stump
tree = DecisionTreeClassifier(criterion='gini', #only Gini and information gain criteria are supported
random_state= random_state, #DecisionTreeClassifier() is not fully deterministic
max_depth=1) #stump
tree = tree.fit(X_train, y_train)
# SAMME AdaBoost
ada = AdaBoostClassifier(base_estimator=tree,
algorithm='SAMME',
random_state= random_state)
ada = ada.fit(X_train, y_train)
# SAMME.R AdaBoost
ada_R = AdaBoostClassifier(base_estimator=tree,
algorithm='SAMME.R',
random_state= random_state)
ada_R = ada_R.fit(X_train, y_train)
The default AdaBoost implementation uses n_estimators=50
and learning_rate=1.0
. Out-of-sample performance for both SAMME and SAMME.R is computed below:
# SAMME AdaBoost classifier performance
y_pred_proba_ada = ada.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_ada[:, 1])
roc_auc = auc(fpr, tpr)
# SAMME.R AdaBoost classifier performance
y_pred_proba_ada_R = ada_R.predict_proba(X_test)
fpr_R, tpr_R, _ = roc_curve(y_test, y_pred_proba_ada_R[:, 1])
roc_auc_R = auc(fpr_R, tpr_R)
# AUC on test dataset
print('AUC SAMME AdaBoost:', roc_auc_score(y_test, y_pred_proba_ada[:, 1]).round(3))
print('AUC SAMME.R AdaBoost:', roc_auc_score(y_test, y_pred_proba_ada_R[:, 1]).round(3))
# number of boosting iterations
n_estimators =
# tree stump and its out-of-sample error
tree = DecisionTreeClassifier(criterion='gini',
random_state=random_state,
max_depth=1)
tree.fit(X_train, y_train)
# SAMME AdaBoost
ada = AdaBoostClassifier(base_estimator=tree,
algorithm='SAMME',
n_estimators=n_estimators,
random_state=random_state
)
ada.fit(X_train, y_train)
# SAMME.R AdaBoost
ada_R = AdaBoostClassifier(base_estimator=tree,
algorithm='SAMME.R',
n_estimators=n_estimators,
random_state=random_state
)
ada_R.fit(X_train, y_train)
# plotting in-sample and out-of-sample error of both SAMME and SAMME.R AdaBoost algorithms as function of boosting iterations.
# the max number of boosting iterations is given by n_estimators
fig = plt.figure()
ax = fig.add_subplot(111)
ada_err_test = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada.staged_predict_proba(X_test)):
ada_err_test[i] = roc_auc_score(y_test, y_pred[:,1])
ada_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada.staged_predict_proba(X_train)):
ada_err_train[i] = roc_auc_score(y_train, y_pred[:,1])
ada_R_err_test = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_R.staged_predict_proba(X_test)):
ada_R_err_test[i] = roc_auc_score(y_test, y_pred[:,1])
ada_R_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_R.staged_predict_proba(X_train)):
ada_R_err_train[i] = roc_auc_score(y_train, y_pred[:,1])
ax.plot(np.arange(n_estimators) + 1, ada_err_test,
label='SAMME AdaBoost Test AUC',
color='red')
ax.plot(np.arange(n_estimators) + 1, ada_err_train,
label='SAMME AdaBoost Train AUC',
color='blue')
ax.plot(np.arange(n_estimators) + 1, ada_R_err_test,
label='SAMME.R AdaBoost Test AUC',
color='orange')
ax.plot(np.arange(n_estimators) + 1, ada_R_err_train,
label='SAMME.R AdaBoost Train AUC',
color='green')
ax.set_ylim((0.5, 0.8))
ax.set_xlabel('n_estimators')
ax.set_ylabel('AUC')
leg = ax.legend(loc='lower right', fancybox=True)
leg.get_frame().set_alpha(0.7)
# save plot (insert path to .png and .pdf)
plt.savefig(r'...png')
plt.savefig(r'...pdf')
# max Gini coefficient for each curve
# ada
print('Max SAMME train Gini:', (2*np.amax(ada_err_train)-1))
print('Max SAMME test Gini:',(2*np.amax(ada_err_test)-1))
# ada_R
print('Max SAMME.R train Gini:',(2*np.amax(ada_R_err_train)-1))
print('Max SAMME.R test Gini:',(2*np.amax(ada_R_err_test)-1))
# similarly, max AUC for each curve
print('Max SAMME train AUC:', np.amax(ada_err_train))
print('Max SAMME test AUC:',np.amax(ada_err_test))
# ada_R
print('Max SAMME.R train AUC:', np.amax(ada_R_err_train))
print('Max SAMME.R test AUC:', np.amax(ada_R_err_test))
# addendum: retrieve position maximum test AUC SAMME.R
np.argmax(ada_R_err_test) # if x returned, the number of iterations is x+1
n_estimators
, max_depth
and learning_rate
¶# GridSearch AdaBoost optimization (change the hyperparameters according to the run under consideration)
param_grid = {
'learning_rate': [0.001, 0.001, 0.1, 1],
'n_estimators': [100, 200, 300, 400]
}
# tree as base learner
tree = tree = DecisionTreeClassifier(criterion='gini',
random_state=random_state,
max_depth=1)
boost = AdaBoostClassifier(base_estimator=tree,
random_state=random_state)
# cross-validation
cv = StratifiedKFold(n_splits=5,
shuffle=False,
random_state=random_state)
grid_boost = GridSearchCV(estimator=boost,
param_grid=param_grid,
scoring=score_gini,
cv=cv,
verbose=10)
grid_boost.fit(X_train, y_train)
# best model/classifier
clf_b = grid_boost.best_estimator_
clf_b
# computing out-of-sample AUC for best classifier
clf_b.fit(X_train, y_train)
# test data: computing AUC
y_scores = clf_b.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
# plotting ROC and showing AUC (on test data)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
# AUC on test dataset
print('AUC:', roc_auc_score(y_test, y_scores).round(3))
We continue the boosting analysis with XGBoost. We do implement 2 separate strategies: ST0 and ST1.
XGBClassifier()
¶Below the hyper-parameters of the default XGBClassifier()
implementation are presented and fitting on X_train
is performed.
# fit model default hyper-parameters
xgb_default = XGBClassifier(random_state=random_state)
xgb_default.set_params()
# fitting the default XGBoostClassifier()
xgb_default.fit(X_train, y_train)
# default XGBoost classifier performance
y_pred_proba_xgb_default = xgb_default.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_xgb_default[:, 1])
roc_auc = auc(fpr, tpr)
# AUC on test dataset
print('AUC:', roc_auc_score(y_test, y_pred_proba_xgb_default[:, 1]).round(3))
max_depth
, learning_rate
, n_estimators
with fixed subsampling¶# GridSearch XGBoost optimization (change the hyperparameters according to the run under consideration)
param_grid = {'max_depth': [1],
'learning_rate': [0.001, 0.01, 0.1, 1],
'n_estimators': [100, 300, 500],
'subsample': [0.5],
'colsample_bytree': [0.75]
}
# cross-validation
cv = StratifiedKFold(n_splits=5,
shuffle=False,
random_state=random_state)
xgb = GridSearchCV(estimator=XGBClassifier(random_state=random_state),
param_grid=param_grid,
scoring=score_gini,
cv=cv,
verbose=10)
xgb.fit(X_train, y_train)
# best score on CV-data
print('Best Gini on CV-data:', xgb.best_score_)
# parameters for best model
print('Best set of hyperparameters:', xgb.best_params_)
# best xgb model
xgb_best = xgb.best_estimator_
xgb_best.fit(X_train, y_train)
# best xgb predictions on test data and AUC
y_pred_proba_xgb = xgb_best.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_xgb[:, 1])
roc_auc = auc(fpr, tpr)
# plotting ROC and showing AUC (on test data)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
# AUC on test dataset
print('AUC:', roc_auc_score(y_test, y_pred_proba_xgb[:, 1]).round(3))
And now we plot the feature importance from the best XGBoost model so far in ST1.
# fitting best model (again, if not saved beforehand)
xgb_best_ov = XGBClassifier(max_depth=,
learning_rate=,
n_estimators=,
subsample=0.5,
colsample_bytree=0.75,
random_state=random_state)
xgb_best_ov.fit(X_train, y_train)
# plotting feature importance graph: top 10 features
plot_importance(xgb_best_ov, max_num_features=10)