NLP with scikit-learn
This course covers the basic techniques you need to know in order to fit a Natural Language Processing Machine Learning pipeline using scikit-learn, a machine learning library for Python.
- Learn about the two main scikit-learn classes for natural language processing: CountVectorizer and TfidfVectorizer
- Learn how to create Bag-of-Words (boW) representations and TF-IDF representations
- Learn how to create a machine learning pipeline to classify BBC news articles into different categories
This course is intended for anyone who wishes to understand how NLP works and, more particularly, how to implement it using scikit-learn.
To get the most out of this course, you should already have an understanding of the Python programming language.
Welcome back. In this lecture, we are going to add a fundamental concept to our NLP survival toolkit, which goes under the name of Term Frequency-Inverse Document Frequency matrix. That’s quite a mouthful! We will understand why this is important, and how to apply it using scikit-learn. So let’s get started.
As we’ve seen, Bag of Words can be a great way to determine the most frequent words in a text, based on the number of times they are used in that context. However, it does not take into account the importance of each single word within a document. In other words, it tends to give more importance to popular words, and less to contextual words.
An alternative method is the Term Frequency-Inverse Document Frequency (TF-IDF) matrix, which allows us to weight each word based on its frequency in the document. But how does it work? Basically, the weight of a term that occurs in a document is simply proportional to the term frequency.
So for example, for the term i in document j, we compute its weight using the following formula, where tf underscore ij describes the number of occurrences of word i in document j. Also, df underscore i describes the number of documents containing i, and N denotes the number of documents.
So higher scores are associated with words that are specific to a particular document, and that are mostly used in that context. Instead, lower scores will be assigned to words that frequently appear in different documents. Hence, higher scores are associated with words that are relevant for that particular document.
From a practical point of view, we typically employ the scikit-learn TfidfVectorizer method to calculate the TF-IDF matrix. Note that the scikit-learn TF-IDF implementation performs, by default, an L2 normalization, which means we are dividing each row by its length. That means you normalize the word's weight with respect to the length of the document. So, we only want to count how often the word appears relative to how long the document is.
Let's open a jupyter notebook like the one you see here on my screen.
For simplicity, I have already created a list of strings for illustration purposes. You can find a student notebook version in the course-specific GitHub repository. Let's run this cell.
We import TfidfVectorizer from the scikit-learn submodule feature_extraction dot text. We also import pandas as pd, since we are going to convert the results into a pandas dataframe. We firstly initialize the vectorizer, we pass the string “english” to the argument stop_words which tells sklearn to discard commonly occurring words in English. Also, remember that each token is put into lowercase by default.
We then fit and transform the vectorizer on the dataset, and store this object into the variable tfidf_vect. We then store the dense tfidf_vect variable inside a pandas dataframe object, which requires us to apply the todense method, and we also specify the columns of that pandas dataframe as the vectorizer's features. We’ll call this object, tfidf_df.
Let's print the dataframe. Note that unlike the Bag of Words representation we saw in Lecture 2, where we had d-dimensional vectors of discrete counts, the TF-IDF matrix instead contains continuous values, as you can easily see here from a simple print. Now let's try to understand when this model is useful in practice.
There are many applications where the TF-IDF algorithm can be applied in practice. And of those, it’s worth mentioning Information retrieval. TF-IDF can be used to extract patterns within (and between) documents, based on analyzed tokens. This might help, for instance, in a scenario where you have a few descriptions of some items (say books) and you want to group them by the similarity of content. To do this, we obviously use the TF-IDF matrix.
The second relevant application is Terms extraction. Remember that in a TF-IDF matrix, the higher the score associated with a token in a document, the more specific that token is. Hence, we can think of using those scores to extract relevant terms that identify a specific document, so we can very easily use some sort of TF-IDF score computation to extract the keywords from a text.
For the sake of illustration, let's focus firstly on the information retrieval task, and take the aforementioned list of documents. To extract information retrieval between documents, we can compare pairs of documents using a pairwise similarity measure, and then we perform matrix multiplication to obtain pairwise similarities between documents.
We perform matrix multiplication with the tfidf_vectorizer to obtain similarities between documents. I simply call the tfidf vectorizer and multiply it with its transpose, and then I store this into the variable pairwise_sim. To obtain a dense array, I apply the todense method on the pairwise_sim.
Take the first document. According to the matrix above, document number three is the most similar to it. That makes sense, right? If not just print the dataset and check the first document versus the third: they both speak about machine learning pipelines.
Now, let's focus on key terms extraction.
Let's take the TF-IDF values for the first document to see if the new dimension makes sense. We basically take the TF-IDF scores from the first document, and store it into a sorted pandas dataframe with respect to the scores.
Don’t worry: this has been done for you in the next snippet.
We now show the results: the term train is the one with the higher score. Indeed, the terms train and test appear only in that document and therefore are pretty specific to it. On the contrary, terms like machine or learning are pretty common in our corpus, and therefore they have lower scores.
Now, let’s consider the following two strings:
- It’s boring, not fun at all
- It’s fun, not boring at all
For a human being, those two strings are obviously different, but from a machine perspective, they share the same structure! This translates into the fact that a machine is not able to distinguish their meaning, and therefore our algorithm might suffer from misclassification.
What we want to stress is that the two BoW representations are exactly the same, but the original texts have a different meaning. Hence, BoW representation has a drawback: it can’t grasp the order in which the words are given in the sentence.
Luckily, we can capture the impact of a word’s neighborhood by taking into account not just single tokens (i.e. unigrams) but also the counts of pairs (bigrams) or triplets (trigrams) of words that appear next to each other. More generally, sequences of n tokens are known as n-grams.
This can easily be done inside the scikit-learn TfidfVectorizer, where we can change the range of tokens that are considered to be features by changing the ngram_range parameter, which is a tuple consisting of the minimum and the maximum length of the sequences of tokens that we wish to consider.
How can we do that? As before, we call tfidfvectorizer but now we pass the ngram_range argument equal to the tuple (1,2). Here we take into account all unigrams and bigrams.
Let's print the tfidf_df dataframe: you see that we do also have biagrams and not just unigrams here.
Now, let's try to see which kind of terms are now relevant for document 1: this is done for you in the next snippet. A simple print of the first ten rows shows that a few pairs of tokens are very specific to the document, such as "data train" or "train test".
If we print the tail of that dataframe, we see those are the terms that are too specific for any one document.
For most applications, the minimum number of tokens should be one, as single words often capture a lot of meaning. Adding bigrams helps in most cases, but longer sequences might lead to overfitting. As a rule of thumb, the number of bigrams could be the number of unigrams squared and the number of trigrams could be the number of unigrams to the power of three, leading to very large feature spaces.
So to conclude, we have seen two different ways of extending the Bag of Words representation we studied in Lecture 2. We’ve investigated the Term Frequency-Inverse Document Frequency representation, which is used as a feature vector, namely a numerical representation of the corpus, and understood its main applications. Also, we have seen that we can take into account the impact of some words on other words using n_grams. In the next lecture, we are going to apply those concepts into a Machine Learning pipeline. See you there!
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.