This course covers the basic techniques you need to know in order to fit a Natural Language Processing Machine Learning pipeline using scikit-learn, a machine learning library for Python.
Learning Objectives
- Learn about the two main scikit-learn classes for natural language processing: CountVectorizer and TfidfVectorizer
- Learn how to create Bag-of-Words (boW) representations and TF-IDF representations
- Learn how to create a machine learning pipeline to classify BBC news articles into different categories
Intended Audience
This course is intended for anyone who wishes to understand how NLP works and, more particularly, how to implement it using scikit-learn.
Prerequisites
To get the most out of this course, you should already have an understanding of the Python programming language.
Welcome! My name is Andrea Giussani, and I will be your instructor for this course on Natural Language Processing with scikit-learn. This is an introductory course, and I will show you the basic techniques that you need to know to fit a machine learning pipeline that involves text input data.
Text Analysis is a major application field for machine learning algorithms. However, the raw data—a sequence of symbols—cannot be fed directly into the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than raw text documents with variable lengths.
Notably, text data comes from very heterogeneous resources. This kind of data is obviously unstructured, and therefore we need to perform very specific preprocessing to be able to fit a machine learning algorithm to a corpus of documents.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (for example, a word) occurring in the corpus. We will learn how to represent it using scikit-learn.
Also, it is important that we standardize those texts into a machine friendly format: we want our model to treat similar words the same semantically. Consider the words dog and dogs: strictly speaking, they are different but they connote the same thing. Moreover, the words produce, produced and producing should be standardized to the same root, regardless of their grammatical use and format.
Since you’ve ended up here, it’s safe to say you’re interested in Natural language processing pipelines. This course is intended for anyone who wishes to understand how NLP basically works, and more particularly if you want to master your knowledge of scikit learn with respect to NLP.
The audience for this course is anyone interested in Machine Learning pipelines, such as Data Scientists, Data Engineers, and Data Analysts. This course will help you understand the main NLP techniques in a real-life scenario, so that you can then reproduce what you learnt for similar tasks in your specific domain.
In this course, you will learn about two main NLP techniques, namely Bag-of-Words Representation and Term-Frequency Inverse Domain Frequency Representation, which are used to transform an input text source into a consistent numerical representation. You will then apply those techniques to a real case scenario, using a BBC News Dataset, to classify those texts with some predefined labels.
The structure of the course goes like this: In Lecture 2 we will explore the Bag of Words Representation, and understand when to use it. In Lecture 3 we will explore the TFIDF representation of the corpus of texts, and understand the advantages of using it. In Lecture 4 we will build a machine learning model used to perform a classification task. We will conclude with Lecture 5 with a simple recap of the main features covered in this course.
So I hope you’re excited to get started. I’ll see you in the next lecture!
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.