One of the best ways to learn new programming languages and concepts is to build something. Learning the syntax is always just the first step. After learning the syntax the question that arises tends to be: what should I build? Finding a project to build can be challenging if you don’t already have some problems in mind to solve.
Throughout this course, we’re going to learn more about Python 3 by building a data ingestion process. We’re going to go from setting up a development VM through to deploying the app to a 16 core, cloud VM where we can test. The application is going to allow us to submit articles to a front-end HTTP endpoint where they’ll be enqueued onto a multiprocessing queue. On the back-end, a set of worker processes will dequeue the articles, extract named entities from the article, and enqueue the results to be saved. A set of saver processes will dequeue the results and save the records to Cloud Firestore. Both the front and back-ends will be run together using supervisord. And we’ll use setuptools to create a software distribution used to deploy the app.
This course is broken up into sprints to give you a real-world development experience, and guide you through each step of building an application with Python.
The source code for the course is available on GitHub.
If you have any feedback relating to this course, feel free to contact us at support@cloudacademy.com.
Learning Objectives
- Configure a local development environment for an app using a VM
- Implement a data processor that can accept text, extract named entities, and return the results
- Implement a multi-process aware message queue and use Pytest
- Create data models to use as messages to pass on the message queue
- Create the backend for the application
- Create a web endpoint that we can use to enqueue our post models
- Implement a method for running the frontend and backend together
- Run the application using a dataset to see how the system performs under actual server load
Intended Audience
This course is intended for software developers or anyone who already has some experience in building simple apps in Python and wants to move on to something more complex.
Prerequisites
To get the most out of this course, you should be familiar with Python, ideally Python 3, and have some knowledge of Linux and how to use Git.
Hello, and welcome to Sprint two. Our goal for this sprint is to implement a data processor that can accept text, extract named entities, and return the results. And we'll also take the time to create a logger that we're going to use throughout the app.
Let's start in by creating the logger. Notice here, I have a directory named ingest. And we have an under init file, which makes this into a package. This is where our core functionality for our application is going to reside. Let's add a new file called debugging.py to give you a sense for how this is going to work. As we go along, I'm going to paste in the template for our code. It's gonna serve as our starting point. Sometimes we'll write the code out right there. And other times I'm just going to paste in the completed code and we can review it.
So let's check out this template. We're importing the logging module and a get logger function from multiprocessing. There's not much going on here. We have a logger function that is gonna return a logging.logger, which will create and expose as app logger. Let's add in the code for this function. Using get logger, we'll obtain our logger. We set our logging level, we create a stream handler. And we set our formatting. There's a lot of options available for the formatter. We're going to use logging level, the time, the process. This is actually the PID, though you can specify the process name. And finally, we'll include the message. Then we need to add the handler to our logger and return it. Okay, so we're going to use this throughout the app to give us some hints about what's happening with our code.
With this done, let's build out the data processor. Let's create a file called processor.py. And I'll paste in our template. The goal of this module is to allow us to extract named entities from text. When thinking about how to accomplish this with Python, I landed on a library named Spacey, there are a lot of options, we could use something such as NLTK, we could also take on the daunting challenge of building this from scratch. However, I found Spacey to be fast, easy to use and very well documented. Spacey considers itself to be the Ruby on Rails of natural language processing. Now, to me that means it tries to be powerful, well reasoned, and relatively easy to use. It can extract entities out of the box with just a few lines of code, making it a good fit for this project.
Whenever I start developing Python code, my preference is to use functions whenever I don't need to store application state. Our data processor is going to hold on to our spacey model. So we do have some state there. Spacey uses pre-trained models to process text and there are different models depending on the language of the text that's being processed. Loading spacey is going to result in the model being loaded into memory. So we want to reuse our instance, let's add some logging to indicate that the model is being loaded. And we'll instantiate spacey by calling spacey.load and passing in the name of the model that we want.
The names of the models are all available in the documentation. I like to have logging indicators both before and after code that might consume a lot of resources such as memory. Now that we have an instance of spacey, we can use it to extract entities. I'll demo this in a moment. For now, just bear with me, we're going to create a list comprehension that will return the entity text for each entity in the doc.ends. The result of this is going to be a list of strings and we'll pass the list to a counter which will return.
Let's install Spacey with PIP so that we can see this in action. Okay, since PIP is asking to be updated, let's just update it while we're here. Awesome, next, we need to install the spacey model that we specified in the load function. Spacey does not include the models with the source code and that's in order to keep the size down. However, we can install it using the spacey modules download function. We call it using Python dash M, specifying spacey as our module, that's the dash M, and then the download function followed by the name of the model that we want to download.
Okay, with all of this done, let's check out spacey in an interactive shell. We'll start by importing spacey. And next, we'll load the model. This is the model we just downloaded. Okay, the load method returns a callable. Notice that asserting that NLP is callable does not result in an error. If it wasn't callable, we'd get an error. Here's an example of some tests that we're gonna process, John has $1,000. Notice the results seems rather underwhelming. It simply echoes our text. Let's use dir and we can see the properties here for NLP. Notice here we have this one called Ents. That's short for entities. This is the property that stores the named entities that were extracted. So appending.ents returns a tuple with john and 1,000.
These entities, they don't just store text, they also have metadata. The label property of an entity describes the type of entity. If we run this, it'll show us the text and label. Oops, actually, text underscores should just be text. Okay, let's fix that. Okay, now we can see that John is a person and the 1,000 relates to money. If I paste in a change to this text, we can see that spacey correctly identifies Apple as an organization. So this is pretty cool, right? We can extract named entities with labels from text with just a few lines of code. However, not all entities are going to be useful in every context.
Entities such as money just really aren't valuable for our use case. So let's filter out some of the entities that we don't need. I'm going to paste in a list here of entities that we want to skip, and I'll also edit this list comprehension to skip them. Now let's take a look at the counter class, which is really just a fancy dictionary that knows how to count similar items. Notice this example here. We processed this text, which makes reference to Apple twice. The list also reflects those two values.
Let's wrap this call in a counter and see what it will produce. We'll import the counter class from collections and wrap the last call. Notice it counts two apples, one john, and one reference to $1,000. So our entities method is going to return a counter which tracks the number of times each entity is mentioned in the processed text.
Next, let's implement the process method. This method accepts a string and returns a dictionary, which has an entities key that stores our counter. I chose to return a dictionary here because I envisioned adding further insights from spacey and wanted a return data structure that I could add to without breaking the existing code.
In the end, I chose to just stick with the entity to simplify our overall code base. Though I did leave this as a dictionary. We'll make further changes to this data processor later in the course. For now, this is really all we need. So we've created a logger using the get logger function from the multiprocessing module. We've implemented our data processor using a library called spacey. We downloaded the English model for spacey and tested it out in a Python shell. This data processor is going to be the heart of our entire app.
Remember the purpose of this app is to extract these named entities so that we can use them later. And with spacey, we were able to do that with just a few lines of code, and with that, we've wrapped up sprint two. In the next sprint, we're going to create a message queue that will allow different processors to communicate. So whenever you're ready, I will see you in the next sprint.
Lectures
Course Introduction - Sprint 1 - Sprint 3 - Sprint 4 - Sprint 5 - Part One - Sprint 5 - Part Two - Sprint 6 - Sprint 7 - Sprint 8 - Sprint 9 - Post Mortem
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.