By Andreas Troxler, June 2022
In this Part II of the tutorial, you will learn techniques that can be applied in situations with few or no labels. This is very relevant in practice: text data is often available, but labels are missing or sparse!
Let’s get started.
This notebook is divided into tutorial is divided into six parts; they are:
Introduction.
We begin by explaining pre-requisites. Then we turn to loading and exploring the dataset – ca. 6k records of short property insurance claim description which we aim to classify by peril type.
Classify by peril type in a supervised setting.
To warm up, we apply supervised learning techniques you have learned in Part I to the dataset of this Part II.
Zero-shot classification.
This technique assigns each text sample to one element of a pre-defined list of candidate expressions. This allows classification without any task-specific training and without using the labels. This fully unsupervised approach is useful in situations with no labels.
Unsupervised classification using similarity.
This technique encodes each input sentence and each candidate expression into en embedding vector. Then, pairwise similarity scores between each input sequence and each candiate expression are calculated. The candidate expression with the highest similarity score is selected. This fully unsupervised approach is useful in situations with no labels.
Unsupervised topic modeling by clustering of document embeddings.
This approach extracts clusters of similar text samples and proposes verbal representations of these clusters. The labels are not required, but may be used in the process if available. This technique does not require prior knowledge of candidate expressions.
This notebook is computationally intensive. We recommend using a platform with GPU support.
We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).
Please note that the results may not be reproducible across platforms and versions.
Make sure the following files are available in the directory of the notebook:
tutorial_utils.py
- a collection of utility functions used throughout this notebookperil.training.csv
- the training dataperil.validation.csv
- the validation dataThis notebook will create the following subdirectories:
models
- trained Transformer modelsresults
- figures and Excel filesFor this tutorial, we assume that you are already familiar with Python and Jupyter Notebook. We also assume that you have worked through Part I of this tutorial.
In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.
# Notebook settings
# clear the namespace variables
from IPython import get_ipython
get_ipython().run_line_magic("reset", "-sf")
# formatting: cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
If you run this notebook on Google Colab, you will need to install the following libraries:
!pip install datasets
!pip install transformers
!pip install plotly
!pip install kaleido
!pip install pyyaml==5.4.1 ## https://github.com/yaml/pyyaml/issues/576
!pip install bertopic
and loaded:
import os
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy.special import softmax
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments, trainer_utils, AutoModelForSequenceClassification
from transformers import pipeline
import torch
from sklearn.metrics import accuracy_score, f1_score
import plotly.express as px
from wordcloud import WordCloud
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from tutorial_utils import extract_sequence_encoding, get_xy, dummy_classifier, logistic_regression_classifier, evaluate_classifier
The dataset used throughout this tutorial concerns property insurance claims of the Wisconsin Local Government Property Insurance Fund (LPGIF), made available in the open text project of Frees. The Wisconsin LGPIF is an insurance pool managed by the Wisconsin Office of the Insurance Commissioner. This fund provides insurance protection to local governmental institutions such as counties, schools, libraries, airports, etc. It insures property claims at buildings and motor vehicles, and it excludes certain natural and man-made perils like flood, earthquakes or nuclear accidents.
The data consists of 6’030 records (4’991 in the training set, 1’039 in the test set) which include a claim amount, a short English claim description and a hazard type with 9 different levels: Fire, Lightning, Hail, Wind, WaterW (weather related water claims), WaterNW (other weather claims), Vehicle, Vandalism and Misc (any other).
The training and validation set are available in separate csv files, which we load into Pandas DataFrames,
create a single column containing the label, and finally create a dataset
.
# load data
df_train = pd.read_csv("peril.training.csv")
df_valid = pd.read_csv("peril.validation.csv")
# extract label texts and create column "labels" which encodes the peril
labels = df_train.columns[:9].to_list()
df_train["labels"] = np.matmul(df_train.iloc[:, :9].values, np.array(range(9),).reshape((9,1)))
df_valid["labels"] = np.matmul(df_valid.iloc[:, :9].values, np.array(range(9),).reshape((9,1)))
# create dataset
ds = DatasetDict({"train": Dataset.from_pandas(df_train), "test": Dataset.from_pandas(df_valid)})
print(f"{ds}")
DatasetDict({ train: Dataset({ features: ['Vandalism', 'Fire', 'Lightning', 'Wind', 'Hail', 'Vehicle', 'WaterNW', 'WaterW', 'Misc', 'Loss', 'Description', 'labels'], num_rows: 4991 }) test: Dataset({ features: ['Vandalism', 'Fire', 'Lightning', 'Wind', 'Hail', 'Vehicle', 'WaterNW', 'WaterW', 'Misc', 'Loss', 'Description', 'labels'], num_rows: 1039 }) })
df_train.head()
Vandalism | Fire | Lightning | Wind | Hail | Vehicle | WaterNW | WaterW | Misc | Loss | Description | labels | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 6838.87 | lightning damage ... | 2 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2085.00 | lightning damage at Comm. Center ... | 2 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 11335.00 | lightning damage at water tower ... | 2 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1480.00 | lightning damge to radio tower ... | 2 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 600.00 | vandalism damage at recycle center ... | 0 |
Let's look at the distribution of peril types in the training and validation set:
stats = pd.DataFrame({
"peril": df_train.columns.values[:-3],
"train": df_train.groupby("labels")["labels"].count().values,
"valid": df_valid.groupby("labels")["labels"].count().values
})
summary = pd.DataFrame({"peril": ["Total"], "train": [stats["train"].sum()], "valid": [stats["valid"].sum()]})
stats = pd.concat([stats, summary], ignore_index=True)
stats
peril | train | valid | |
---|---|---|---|
0 | Vandalism | 1774 | 310 |
1 | Fire | 171 | 46 |
2 | Lightning | 832 | 123 |
3 | Wind | 296 | 107 |
4 | Hail | 76 | 18 |
5 | Vehicle | 852 | 227 |
6 | WaterNW | 202 | 67 |
7 | WaterW | 426 | 38 |
8 | Misc | 362 | 103 |
9 | Total | 4991 | 1039 |
fig = px.bar(df_train["labels"].value_counts().sort_index()+df_valid["labels"].value_counts().sort_index(), width=640)
fig.update_layout(title="number of claims by peril type", xaxis_title="peril type",
yaxis_title="number of claims")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "peril_type"}})