# The Art of Natural Language Processing: RNNs for the Case Study¶

### Authors: Andrea Ferrario, Mara Nägelin¶

Date: February 2020 (updated September 2020)

Notebook to run the RNNs in the Contemporary Approach, as described in the tutorial The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification'.

1. Getting started with Python and Jupyter Notebook
2. Import data
3. Data preprocessing
3.1. Remove duplicates
3.2. Shuffle the data
3.3. Minimal preprocessing (detailed)
3.4. Minimal preprocessing with Keras
4. Deep learning
4.1. Train test split
4.2. Define the model
.......4.2.1. Shallow LSTM
.......4.2.2. Shallow GRU
.......4.2.3. Deep LSTM
4.3. Train the model
4.4. Evaluate the model on test data
5. Final remarks

# 1. Getting started with Python and Jupyter Notebook¶

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility.

# 2. Import data¶

First, we import the raw data from the original 50'000 text files and save to a dataframe. This only needs to be run once. After that, one can start directly with Section 2. The following code snippet is based on the book Python Machine Learning by Raschka and Mirjalili, Chapter 8 (see tutorial).

# 3. Data preprocessing¶

Next we preprare the raw data such that in can be used as input for a neural network. Again, we follow the example of Raschka and Mirjalili (Chapter 16). We perform the following steps:

• We remove all duplicates.
• We shuffle the data in a random permutation.
• We apply only minimal preprocessing (i.e. convert to lowercase and split on whitespaces and punctuation).
• We map each word bijectively to an integer value.
• We set each review to an equal length $T$ by padding with $0$ or slicing as required.

The last three steps are written out in detail in Section 2.3. to give the reader an understanding of what exactly happens to the data. However, they can also be carried out — almost equivalently — using the high-level text.preprocessing functionalities of the tensorflow.keras module, see Section 2.4. The user needs to run only one of these two subsections to preprocess the data.

The transformed data is stored in a dataframe for convenience. Hence Section 2 also needs to be run only once, and after one can start jump directly to Section 3.

The following can be used to reimport the dataframe with the raw data generated in Section 1 above.

## 3.3. Minimal preprocessing (detailed)¶

The following snippets are in part adapted from Raschka and Mirjalili (Chapter 16).

## 3.4. Minimal preprocessing with Keras¶

Lastly, we save the fully preprocesssed data to csv for further use.

# 4. Deep Learning¶

In this section, we reproduce the results from Section 6.4.6-6.4.7 of the tutorial. We split the preprocessed data into a training and a testing set and define our RNN model (three different possible architectures are given as examples, see Sections 4.2.1.-4.2.3. The model is compiled and trained using the high-level tensorflow.keras API. Finally, the development of loss and accuracy during the training is plotted and the fitted model is evaluated on the test data.

WARNING: Note that training with a large training dataset for a large number of epochs is computationally intensive and might easily take a couple of hours on a normal CPU machine. We recommend subsetting the training and testing datasets and/or using an HPC infrastructure.

## 4.2. Define the model¶

Each of the following subsections defines a distinct model architecture. The user can select and run one of them.

### 4.2.1. Shallow LSTM architecture¶

This a shallow RNN with just one LSTM layer. The same architecture was used for the example by Raschka and Marjili.

### 4.2.2. Shallow GRU architecture¶

This is essentially the same shallow RNN as above with a GRU layer instead of the LSTM.

### 4.2.3. Deep LSTM architecture¶

We can easily deepen our network by stacking a second LSTM layer on top of the first.

## 4.3. Train the model¶

WARNING: as mentioned in the tutorial, the following training routine is computationally intensive. We recommend to sub-sample data in Section 4.1. and/or use HPC infrastructure. Note that we ran all the machine learning routines presented in this section on the ETH High Performance Computing (HPC) infrastructure Euler, by submitting all jobs to a virtual machine consisting of 32 cores with 3072 MB RAM per core (total RAM: 98.304 GB). Therefore, notebook outputs are not available for the subesquent cells.

# 5. Final remarks¶

The above example RNNs are simple architectures where none of the parameters were optimized for performance. In order to further improve the model accuracy, we could for example

• play around with the network architecture
(e.g. the depth of the network, the type of layers used, the number of hidden units within a layer, the activation functions used, ...)
• fine-tune the training parameters
(i.e. the number of epochs, batch size, ...)
• perform more elaborate preprocessing on the data
(e.g. excluding stopwords, see also the two other Notebooks and Section 1 of the tutorial)
• use a the weights of an already trained embedding for our embedding layer
(either as nontrainable fixed weights or with transfer learning, compare the Notebook NLP_IMDb_Case_Study_ML.ipynb` and Section 3 of the tutorial)

Finally, note that the size of the dataset is arguably still too small to allow for much improvement over the presented architectures and results.