Actuarial Applications of Natural Language Processing Using Transformers

A Case Study for Processing Text Features in an Actuarial Context

Part II – Case Studies on Property Insurance Claim Descriptions - Unsupervised Techniques

By Andreas Troxler, June 2022

In this Part II of the tutorial, you will learn techniques that can be applied in situations with few or no labels. This is very relevant in practice: text data is often available, but labels are missing or sparse!

Let’s get started.

Notebook Overview

This notebook is divided into tutorial is divided into six parts; they are:

  1. Introduction.
    We begin by explaining pre-requisites. Then we turn to loading and exploring the dataset – ca. 6k records of short property insurance claim description which we aim to classify by peril type.

  2. Classify by peril type in a supervised setting.
    To warm up, we apply supervised learning techniques you have learned in Part I to the dataset of this Part II.

  3. Zero-shot classification.
    This technique assigns each text sample to one element of a pre-defined list of candidate expressions. This allows classification without any task-specific training and without using the labels. This fully unsupervised approach is useful in situations with no labels.

  4. Unsupervised classification using similarity.
    This technique encodes each input sentence and each candidate expression into en embedding vector. Then, pairwise similarity scores between each input sequence and each candiate expression are calculated. The candidate expression with the highest similarity score is selected. This fully unsupervised approach is useful in situations with no labels.

  5. Unsupervised topic modeling by clustering of document embeddings.
    This approach extracts clusters of similar text samples and proposes verbal representations of these clusters. The labels are not required, but may be used in the process if available. This technique does not require prior knowledge of candidate expressions.

  6. Conclusion

1. Introduction

In this section we discuss the pre-requisites, load and inspect the dataset.

1.1. Prerequisites

Computing Power

This notebook is computationally intensive. We recommend using a platform with GPU support.

We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).

Please note that the results may not be reproducible across platforms and versions.

Local files

Make sure the following files are available in the directory of the notebook:

This notebook will create the following subdirectories:

Getting started with Python and Jupyter Notebook

For this tutorial, we assume that you are already familiar with Python and Jupyter Notebook. We also assume that you have worked through Part I of this tutorial.

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.

Importing Required Libraries

If you run this notebook on Google Colab, you will need to install the following libraries:

and loaded:

1.2. Loading the Data

The dataset used throughout this tutorial concerns property insurance claims of the Wisconsin Local Government Property Insurance Fund (LPGIF), made available in the open text project of Frees. The Wisconsin LGPIF is an insurance pool managed by the Wisconsin Office of the Insurance Commissioner. This fund provides insurance protection to local governmental institutions such as counties, schools, libraries, airports, etc. It insures property claims at buildings and motor vehicles, and it excludes certain natural and man-made perils like flood, earthquakes or nuclear accidents.

The data consists of 6’030 records (4’991 in the training set, 1’039 in the test set) which include a claim amount, a short English claim description and a hazard type with 9 different levels: Fire, Lightning, Hail, Wind, WaterW (weather related water claims), WaterNW (other weather claims), Vehicle, Vandalism and Misc (any other).

The training and validation set are available in separate csv files, which we load into Pandas DataFrames, create a single column containing the label, and finally create a dataset.

1.3 Exploring the data

The first records of the training dataset look like this:

Let's look at the distribution of peril types in the training and validation set: