1. Home
  2. Training Library
  3. Machine Learning
  4. Machine Learning Courses
  5. Introduction to Natural Language Processing with Scikit-learn

Preprocessing Text Data


Preprocessing Text Data

This course covers the basic techniques you need to know in order to fit a Natural Language Processing Machine Learning pipeline using scikit-learn, a machine learning library for Python.

Learning Objectives

  • Learn about the two main scikit-learn classes for natural language processing: CountVectorizer and TfidfVectorizer
  • Learn how to create Bag-of-Words (boW) representations and TF-IDF representations
  • Learn how to create a machine learning pipeline to classify BBC news articles into different categories

Intended Audience

This course is intended for anyone who wishes to understand how NLP works and, more particularly, how to implement it using scikit-learn.


To get the most out of this course, you should already have an understanding of the Python programming language.


Welcome back. In this lecture, we are going to look at how to process raw text data. In general, when dealing with text data, we need to perform rigorous data cleaning in order to extract useful information from the data. For example, if we want to apply a supervised machine learning algorithm to classify a movie review, we need to transform the textual component into something that looks like a numeric feature, otherwise, the model would not be able to ingest the data in that form. It is a similar process we encountered when dealing with categorical variables in the course Building a Machine Learning Pipeline with scikit-learn: part 1. 

In this lecture, we describe the standard preprocessing pipeline for text analytics, which typically consists of three major steps.

  1. Tokenization. Each text is split into single, separated words, based on some user-defined rules, such as converting all words to lowercase, removing all stop and repeated words, as well as punctuation.
  2. Lemmatization. This process basically forces a conjugated verb to be replaced by its simple form, e.g. spoken will be replaced by speak, as well as any transformation regarding the third person.
  3. Stemmization. Each single remaining word is reduced to its root form.

In particular, we;ll focus on the process of Tokenization. Please note that it is good practice to perform a few operations at this stage of the pipeline. For example, we typically put into lowercase the text we wish to tokenize and remove unwanted characters such as the non-alphabetic ones. This step is called normalization of the text, and it is a vital step in order for our pipeline to be consistent. 

For the sake of completeness, scikit-learn was not born as an NLP library, and therefore it has a few limitations. One of them is that it does not perform lemmatization by default, and therefore in case we need to do that, we have to use libraries other than scikit-learn. You can plug in other normalization from NLTK or spacy into the count vectorizer if you want, but the default behaviour is that scikit lowercases the tokens.

So, tokenization is the process that turns a string or document into tokens, and this turns out to be the first step in preparing a text for NLP. A token is therefore a meaningful unit of text, commonly referred to as a word, from which we can infer some context. Note that words such as don’t will be split into two words after tokenization, that is do and n’t, so we need to take care of contraction, actualization and characters when carrying out this important step.

In general, we call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting, and normalization) is called the Bag of Words (or BoW) representation.

The scikit-learn CountVectorizer class is a good candidate to produce the Bag-of-Words representation. Hence, all the preprocessing steps can be done inside this class by specifying a few arguments of this class.

It is a transformer, since it embeds the input into a different dimension. For more information on scikit-learn transformers, please check out the course Building a Machine Learning Pipeline with scikit-learn: part 1 available in our content library.

It is worth showing you a little example before jumping into a real-life example. In particular, let's consider a list of two strings. It can be a list, or any iterable, since the transformation process requires the input to be an iterable.

For example, we create a list made of the string, say,  "The cat is on the sofa." and another one "Authentication is performed Machine to machine." Let's store this object into the variable my_data. 

To ingest a list of strings inside a model, we need to convert them into a numerical vector. A possible strategy is to represent a corpus of documents by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We would like to apply the CountVectorizer to normalize the text and perform Bag of Words on the cleaned tokens. 

We import the CountVectorizer from the scikit submodule feature_extraction dot text.

We first initialize the class CountVectorizer, and then we apply the fit_transform method on the list of strings we wish to tokenize. 

CountVectorizer.fit_transform() is returns a SciPy Compressed Sparse Row Matrix, which has a toarray() method to convert it into a dense matrix.

Now, can you spot a potential problem here? Well, the CountVectorizer is taking into account stop words, such as the, is, and to. Why should we remove them? Well, they typically introduce some noise, in the sense that a pattern in a sentence is identified by relevant words, and not by common terms that are used to give a linguistic meaning to a bunch of tokens. For a classic NLP pipeline, we usually remove them.

So we specify the parameter stop_words as equal to “english” and also the argument lowercase as equal to True.

By default, the CountVectorizer makes the tokens lowercase. This is the standard in any ML pipeline (we do not want the token “THE” to be different from its lowercase counterpart), so my advice is not to disable it.

We can check which terms have not been removed as stop words and we see stop words have indeed been removed.

In order to do that, we use the get_feature_names method, and we see that we are getting these words. Also, we are getting a warning since in version 1.0 the function get_feature_names is gonna be deprecated and will be removed in version 1.2. The library is suggesting that it is better to use get_feature_names_out, so let us use it. Here we go: we get an array with the terms not removed as stop words.

Let's try to apply the same logic to a real dataset. We will use a public dataset composed of 2225 BBC articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech. The dataset is broken into 1490 records for training and 735 for testing. You can find it either in our course-specific GitHub repository or from the original source. In this Lecture, we just need the training set, and we import it using the pandas read_csv method.

Please be sure you get the correct path where you store the data inside the function.

Let's also print the first five rows of our dataframe.

Since the transformation process requires an iterable, we store the texts into a list, using the series method to_list.

We then fit a CountVectorizer by setting lowercase=True and stop_words='english'. We then apply the fit_transform method on the list of texts, and then store the result into a pandas dataframe for readability reasons. Also we remove words that appear in less than 4 documents using the min_df argument. 

Then wrap the vectorizer dot fit_transform method on our corpus into a pandas dataframe, and specify the columns by using vectorizer dot get_feature_names. Store the results into the variable capital x.

We inspect the first 5 rows with the head method. We have 8312 columns (thanks to the min_df argument, otherwise would have been 24456 columns), each describing a token in our corpus. The row values instead describe the number of times that specific token appeared in that text - remember that each single row is a text.

Also, we can set max_features as equal to 1000 so that the vectorizer will build a vocabulary of top 1000 words (by frequency). This means that each text in our dataset will be converted to a vector of size 1000.

How can we check that? Well, let’s firstly print the shape of X.

We then create a new object named vectorizer_fixed which is the COuntVectorizer with the argument max_features equal to 1000. We create a new dataframe called X_fixed which is exactly the same as the X above but with the vectorizer_fixed inside. We see we have 1000 columns, as expected.

We can try to plot the frequency distribution of the tokens over the corpus. To do so, we can sum over each single column, and retain the first top 20 tokens in descending order for readability reasons. This has been done for you in the next snippet, using the pandas sort_values method with respect to the count column.

We also create the top 20 tokens dataframe by calling head 20 on terms_distribution variable. Store this into the variable terms_distribution_top

We then print those top 20 tokens using two important data visualization libraries: matplotlib and seaborn. We import them, in particular, just the pyplot submodule from matplotlib as plt and seaborn as sns. We then initialize a figure and an axes object with the subplots function.

We then assign to the axes a seaborn barplot, with x being equal to terms_distribution_top.index and y equal to terms_distribution_top.top_count.

Also, for readability reasons, we apply the set_xtickslabels method so that the tokens' labels are rotated by 45 degrees. 

Note that despite its popularity, the Bag of Words method has many disadvantages. On the one hand, the ordering of tokens is completely lost, which implies that different sentences might have the same numerical representation. On the other hand, BoW tends to ignore the semantics of the words, that is the distance between two (or more) words.

BoW can be a great way to determine the significant words in a text, based on the number of times they are used. However, the above frequency matrix does not take into account the importance of each single word within a document. In other words, it tends to give more importance to popular words, and less to contextual words, which might be relevant for language understanding purposes. To solve this issue, we can use the TFIDF matrix, which will be discussed in detail in the next lecture. See you there.

About the Author
Learning Paths

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.