The Art of Natural Language Processing: RNNs for the Case Study

Authors: Andrea Ferrario, Mara Nägelin

Date: February 2020 (updated September 2020)

Notebook to run the RNNs in the Contemporary Approach, as described in the tutorial `The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification'.

Table of contents

  1. Getting started with Python and Jupyter Notebook
  2. Import data
  3. Data preprocessing
    3.1. Remove duplicates
    3.2. Shuffle the data
    3.3. Minimal preprocessing (detailed)
    3.4. Minimal preprocessing with Keras
  4. Deep learning
    4.1. Train test split
    4.2. Define the model
    .......4.2.1. Shallow LSTM
    .......4.2.2. Shallow GRU
    .......4.2.3. Deep LSTM
    4.3. Train the model
    4.4. Evaluate the model on test data
  5. Final remarks

1. Getting started with Python and Jupyter Notebook

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility.

2. Import data

First, we import the raw data from the original 50'000 text files and save to a dataframe. This only needs to be run once. After that, one can start directly with Section 2. The following code snippet is based on the book Python Machine Learning by Raschka and Mirjalili, Chapter 8 (see tutorial).

3. Data preprocessing

Next we preprare the raw data such that in can be used as input for a neural network. Again, we follow the example of Raschka and Mirjalili (Chapter 16). We perform the following steps:

The last three steps are written out in detail in Section 2.3. to give the reader an understanding of what exactly happens to the data. However, they can also be carried out — almost equivalently — using the high-level text.preprocessing functionalities of the tensorflow.keras module, see Section 2.4. The user needs to run only one of these two subsections to preprocess the data.

The transformed data is stored in a dataframe for convenience. Hence Section 2 also needs to be run only once, and after one can start jump directly to Section 3.

The following can be used to reimport the dataframe with the raw data generated in Section 1 above.

3.1. Remove duplicates

3.2. Shuffle the data

3.3. Minimal preprocessing (detailed)

The following snippets are in part adapted from Raschka and Mirjalili (Chapter 16).

3.4. Minimal preprocessing with Keras




Lastly, we save the fully preprocesssed data to csv for further use.

4. Deep Learning

In this section, we reproduce the results from Section 6.4.6-6.4.7 of the tutorial. We split the preprocessed data into a training and a testing set and define our RNN model (three different possible architectures are given as examples, see Sections 4.2.1.-4.2.3. The model is compiled and trained using the high-level tensorflow.keras API. Finally, the development of loss and accuracy during the training is plotted and the fitted model is evaluated on the test data.

WARNING: Note that training with a large training dataset for a large number of epochs is computationally intensive and might easily take a couple of hours on a normal CPU machine. We recommend subsetting the training and testing datasets and/or using an HPC infrastructure.

4.1. Train test split

4.2. Define the model

Each of the following subsections defines a distinct model architecture. The user can select and run one of them.

4.2.1. Shallow LSTM architecture

This a shallow RNN with just one LSTM layer. The same architecture was used for the example by Raschka and Marjili.

4.2.2. Shallow GRU architecture

This is essentially the same shallow RNN as above with a GRU layer instead of the LSTM.

4.2.3. Deep LSTM architecture

We can easily deepen our network by stacking a second LSTM layer on top of the first.

4.3. Train the model

WARNING: as mentioned in the tutorial, the following training routine is computationally intensive. We recommend to sub-sample data in Section 4.1. and/or use HPC infrastructure. Note that we ran all the machine learning routines presented in this section on the ETH High Performance Computing (HPC) infrastructure Euler, by submitting all jobs to a virtual machine consisting of 32 cores with 3072 MB RAM per core (total RAM: 98.304 GB). Therefore, notebook outputs are not available for the subesquent cells.

4.4. Evaluate the model on test data

5. Final remarks

The above example RNNs are simple architectures where none of the parameters were optimized for performance. In order to further improve the model accuracy, we could for example

Finally, note that the size of the dataset is arguably still too small to allow for much improvement over the presented architectures and results.