NLP with scikit-learn
This course covers the basic techniques you need to know in order to fit a Natural Language Processing Machine Learning pipeline using scikit-learn, a machine learning library for Python.
- Learn about the two main scikit-learn classes for natural language processing: CountVectorizer and TfidfVectorizer
- Learn how to create Bag-of-Words (boW) representations and TF-IDF representations
- Learn how to create a machine learning pipeline to classify BBC news articles into different categories
This course is intended for anyone who wishes to understand how NLP works and, more particularly, how to implement it using scikit-learn.
To get the most out of this course, you should already have an understanding of the Python programming language.
Welcome back. In this lecture, we will apply our knowledge to a real-life example in order to fit a classifier to text data using scikit-learn.
In particular, we will use the BBC News Dataset we used in Lecture 2.
Recall that the dataset is made up of 2,225 articles, each labeled under one of the following five categories: business, entertainment, politics, sport, or tech.
The dataset is broken into 1,490 records for training and 735 for testing. The goal is to build a system that can accurately classify previously unseen news articles into the right category. This lecture has taken inspiration from a Google blog post. The link is provided in the Useful Links section below if you want to get more info on this topic.
We import the data using the pandas read_csv function. In this demo, I am using a Google colab notebook, so the data is stored inside my personal google drive. If you don’t know how to set up this environment, I have written a readme for you in the course-specific GitHub repository.
Let's print the first five rows of our dataset using the head method. The first step, before jumping into the machine learning pipeline steps, is to create a new column inside the dataframe that we call category underscore id: this column maps the category values into a unique numerical index. To do so, we use a scikit learn transformer called LabelEncoder, which basically performs two operations: at first, it converts the labels into a numerical index, and learns the underlying mapping; and then it transforms the original label value with the corresponding mapped value.
Why should I do this step? Well, the reason is pretty simple. A machine learning algorithm requires the features to be expressed into a numerical dimension, and therefore we need to convert a column of type object - just like in our example - into a numerical one. And since the labels have a physical meaning, we assign one-to-one mapping by means of the Label Encoder.
We import the LabelEncoder from the scikit learn preprocessing submodule.
We then initialize the LabelEncoder, and call it LE, and then we transform the Category column into a new numerical dimension by applying the fit_transform method. We store this new representation inside the column Category_id. We can inspect the results of this transformation: business has been converted to the index 0, whereas tech has been converted to the index 4.
We can obviously get back to strings using the inverse_transform method: in this case, we pass a list of numerical indexes, which are then converted by the transformer into a list of strings. For instance, if we pass the list made of 2, 0 and 4, we get business associated to index 0, tech associated to index 4, and politics associated to index 2.
Ok, we are now ready to jump into the text classification pipeline. To do so, we split the train data into two sets: the set of features, namely the column Text, denoted by X_train, and the set of labels, namely the column category_id, denoted by y, created with the label encoder.
We split the data into train and test sets, using the submodule model_selection in scikit-learn. We create four objects, namely X_train, X_test, y_train, and y_test by using the train_test_split function, passing the set of features, the target, and the argument test_size as equal to 0.3 - that is, we use 30% of the original data s test set. We also set random_state to 42 - but you can choose whatever you like.
We are now ready to fit a machine learning pipeline using the scikit-learn Pipeline object. We make the necessary imports here. We import the Pipeline class from the pipeline submodule, but we also import two other important classes: the TfidfVectorizer from the feature_extraction dot text submodule, and also LogisticRegression from the scikit learn submodule linear_model.
We fit a Pipeline object by passing two steps. First, we need to convert the text data into a feature vector. This is done using the TFIDF representation.
In particular, we specify stop_words equal to the string 'english' in order to remove common English words, such as pronouns. We also specify the argument min_df equal to 5 so that we keep only the terms that appear in at least five different sentences. We specify the argument n_gram_range equal to (1,2), meaning that we keep track of both unigrams and bigrams, and finally, we set the argument sublinear_tf to True. This argument is pretty interesting, and I strongly encourage you to use it especially when you are working with heterogeneous text data.
The second step is to pass a LogisticRegression as a classifier. If you do not know what a Logistic Regression is, I strongly encourage you to watch the course Building a Machine Learning Pipeline with scikit-learn: part 2.
Here we go: we then fit the pipeline object to the train data, and then we use the trained model to perform the prediction of labels on the test set. For illustration purposes, let's perform this operation on just the first 5 test items.
We also get the true labels for those five records, and we then print both objects. It seems we are performing pretty well, right? Possibly this result might be biased from the selection of the first five rows. To validate our results, let's predict using the whole test set, and compute some metrics using the classification report, which is imported from the scikit metrics submodule. This function is useful since it returns the most important metrics - precision, recall, and f1 score - for our test set.
So this is impressive! We are estimating all four categories with high precision, with an incredible recall. We can also show the confusion matrix. To produce it, we use the scikit-learn confusion_matrix method. Since it requires a little bit of manual work, the snippet has been provided down below here for you; let's run it. You can see that we are performing pretty well in all categories. We have misclassified some examples, but that can happen sometimes - some texts might speak of both business and tech with the same meaning.
So that brings us to the end of this lecture. Obviously, you could have performed the TFIDF transformation and the fitting phase separately, that is, outside a Pipeline object. However, using the Pipeline object is much more efficient and is clearer. We have seen how to build a simple text classification pipeline using scikit-learn and you can embellish it with more complex models. You are now ready to go into the wild and apply these techniques to your own projects. See you in the next lecture, where I will recap what we’ve covered in the course.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.