Actuarial Applications of Natural Language Processing Using Transformers

A Case Study for Processing Text Features in an Actuarial Context

Part I – Introduction and Case Studies on Car Accident Descriptions

By Andreas Troxler, June 2022

An abundant amount of information is available to insurance companies in the form of text. However, language data is unstructured, sometimes multilingual, and single words or phrases taken out of context can be highly ambiguous. By the help of transformer models, text data can be converted into structured data and then used as input to predictive models.

In this Part I of tutorial, you will discover the use of transformer models for text classification. Throughout this tutorial, the HuggingFace Transformers library will be used.

This notebook serves as a companion to the tutorial "Actuarial Applications of Natural Language Processing Using Transformers”. The tutorial explains the underlying concepts, and this notebook illustrates the implementation. This tutorial, the dataset and the notebooks are available on github.

After competing this tutorial, you will know:

Let’s get started.

Notebook Overview

This notebook is divided into into seven parts; they are:

  1. Introduction

    1.1 Prerequisites

    1.2 Exploring the data

  2. A brief introduction to the HuggingFace ecosystem

    2.1 Loading the data into a DataSet

    2.2 Tokenization – splitting the raw text

    2.3 The transformer model

  3. Using transformers to extract features for classification or regression tasks

    3.1 Extracting the encoded text ...

    3.2 ... and using it in a classification model

    3.3 Case study: use accident descriptions to predict the number of vehicles involved

    3.4 Cross-lingual transfer

    3.5 Multi-lingual training

  4. Fine-tuning – improving the model

    4.1. Domain-specific finetuning

    4.2. Task-specific finetuning

  5. Understand predictions errors and interpret predictions

    5.1. Case study: use accident descriptions to identify bodily injury

    5.2. Investigate false positives and false negatives

    5.3. Use Captum and transformers-interpret to interpret predictions

  6. Using extractive question answering to process longer texts

  7. Conclusion

1. Introduction

1.1. Prerequisites

Computing Power

This notebook is computationally intensive. We recommend using a platform with GPU support.

We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).

Please note that the results may not be reproducible across platforms and versions.

Local files

Make sure the following files are available in the directory of the notebook:

This notebook will create the following subdirectories:

Getting started with Python and Jupyter Notebook

For this tutorial, we assume that you are already familiar with Python and Jupyter Notebook.

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.

Importing Required Libraries

The following libraries are required:

In addition, we require openpyxl to enable export from Pandas to Excel.

1.2. Exploring the Data

The data used throughout this tutorial is derived from data of a vehicle crash causation study performed in the United States from 2005 to 2007. The dataset has almost 7'000 records, each relating to one accident. For each case, a verbal description of the accident is available in English, which summarizes road and weather conditions, vehicles, drivers and passengers involved, preconditions, injury severities, etc. The same information is also encoded in tabular form, so that we can apply supervised learning techniques to train the NLP models and compare the information extracted from the verbal descriptions with the encoded data.

The original data consists of multiple tables. For this tutorial, we have aggregated it into a single dataset and added German translations of the English accident descriptions. The translations were generated using the new DeepL python API.

To explore the data, let's load it into a Pandas DataFrame and examine its shape, columns and data types:

The column SCASEID is a unique case identifier.

The columns SUMMARY_EN and SUMMARY_GE are strings representing the verbal descriptions of the accident in English and German, respectively.

NUMTOTV is the number of vehicles involved in the case. Let's have a look at the distribution of this feature: