The Art of Natural Language Processing: NLP Pipeline

Authors: Andrea Ferrario, Mara Nägelin

Date: February 2020 (updated September 2020)

Notebook to test NLP preprocessing pipelines, as described in the tutorial `The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification'.

Table of contents

  1. Getting started with Python and Jupyter Notebook
  2. Test sentence
  3. NLP preprocessing pipelines
    3.1. Conversion of text to lowercase
    3.2. Tokenizers
    3.3. Stopwords removal
    3.4. Part-of-speech tagging
    3.5. Stemming and lemmatization

1. Getting started with Python and Jupyter Notebook

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility.

2. Test sentence

We introduce the test sentence to be preprocessed with NLP.

We follow the NLP pipeline:

3. NLP Preprocessing Pipeline

3.1. Conversion of text to lowercase

We apply lowercase to the test sentence.

3.2. Tokenizers

3.3. Stopwords removal

We now remove stopwords using NLTK, SpaCy and sklearn.

3.4. Part-of-speech tagging

We perform Part-Of-Speech (POS) tagging using the NLTK.

3.5. Stemming and lemmatization

We perform (Porter) stemming and lemmatization on the test sentence, after tokenization.