The Art of Natural Language Processing: Machine Learning for the Case Study

Authors: Andrea Ferrario, Mara Nägelin

Date: February 2020 (updated September 2020)

Notebook to run the machine learning modeling in the Classical and Modern Approaches, as described in the tutorial `The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification'.

Table of contents

  1. Getting started with Python and Jupyter Notebook
  2. Import data
  3. Duplicated reviews
  4. Data preprocessing
  5. POS-tagging
  6. Pre-trained word embeddings
  7. Data analytics
    7.1. A quick check of data structure
    7.2. Basic linguistic analysis of movie reviews
  8. Machine learning
    8.1. Adaptive boosting (ADA)
    .......8.1.1. Bag-of-words
    .......8.1.2. Bag-of-POS
    .......8.1.3. Embeddings
    8.2. Random forests (RF)
    .......8.2.1. Bag-of-words
    .......8.2.2. Bag-of-POS
    .......8.2.3. Embeddings
    8.3. Extreme gradient boosting (XGB)
    .......8.3.1. Bag-of-words
    .......8.3.2. Bag-of-POS
    .......8.3.3. Embeddings

1. Getting started with Python and Jupyter Notebook

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility.

2. Import data

3. Duplicated reviews

4. Data preprocessing

5. POS - tagging

6. Pre-trained word embeddings

7. Data analytics

We reproduce main data analytics results in Section 6.3 of the tutorial. We use the preprocessed and deduplicated data, for simplicity.

7.1. A quick check of data structure

7.2. Basic linguistic analysis of movie reviews

8. Machine Learning

We replicate the machine learning pipelines from the tutorial, Section 6.4 (Classical and Modern Approaches).

WARNING: as mentioned in the tutorial, the following cross-validation routines are computationally intensive. We recommend to sub-sample data and/or use HPC infrastructure (specifying the parameter njobs in GridSearch() accordingly). Test runs can be launched on reduced hyperparameter grids, as well. Note that we ran all the machine learning routines presented in this section on the ETH High Performance Computing (HPC) infrastructure Euler, by submitting all jobs to a virtual machine consisting of 32 cores with 3072 MB RAM per core (total RAM: 98.304 GB). Therefore, notebook cell outputs are not available for this section.

8.1. Adaptive boosting (ADA)

We use the adaptive boosting (ADA) algorithm on top of NLP pipelines (bag-of models and pre-trained word embeddings).

8.1.1. Bag-of-words

8.1.2. Bag-of-POS

8.1.3. Embeddings

8.2. Random Forests (RF)

We use the random forests (RF) algorithm on top of NLP pipelines (bag-of models and pre-trained word embeddings).

8.2.1. Bag-of-words

8.2.2. Bag-of-POS

8.2.3. Embeddings

8.3. Extreme gradient boosting (XGB)

We use the extreme gradient boosting (XGB) algorithm on top of NLP pipelines (bag-of models and pre-trained word embeddings). We can use the cell below to install xgboost, if other imports failed.

8.3.1. Bag-of-words

8.3.2. Bag-of-POS

8.3.3. Embeddings