One of the best ways to learn new programming languages and concepts is to build something. Learning the syntax is always just the first step. After learning the syntax the question that arises tends to be: what should I build? Finding a project to build can be challenging if you don’t already have some problems in mind to solve.
Throughout this course, we’re going to learn more about Python 3 by building a data ingestion process. We’re going to go from setting up a development VM through to deploying the app to a 16 core, cloud VM where we can test. The application is going to allow us to submit articles to a front-end HTTP endpoint where they’ll be enqueued onto a multiprocessing queue. On the back-end, a set of worker processes will dequeue the articles, extract named entities from the article, and enqueue the results to be saved. A set of saver processes will dequeue the results and save the records to Cloud Firestore. Both the front and back-ends will be run together using supervisord. And we’ll use setuptools to create a software distribution used to deploy the app.
This course is broken up into sprints to give you a real-world development experience, and guide you through each step of building an application with Python.
The source code for the course is available on GitHub.
If you have any feedback relating to this course, feel free to contact us at firstname.lastname@example.org.
- Configure a local development environment for an app using a VM
- Implement a data processor that can accept text, extract named entities, and return the results
- Implement a multi-process aware message queue and use Pytest
- Create data models to use as messages to pass on the message queue
- Create the backend for the application
- Create a web endpoint that we can use to enqueue our post models
- Implement a method for running the frontend and backend together
- Run the application using a dataset to see how the system performs under actual server load
This course is intended for software developers or anyone who already has some experience of building simple apps in Python and wants to move on to something more complex.
To get the most out of this course, you should be familiar with Python, ideally Python 3, and have some knowledge Linux and how to use Git.
Hello and welcome! My name is Ben Lambert and I'll be your instructor for this course. This course is intended to help you gain a better understanding of Python by building out an application that's a bit more complex than your standard hello world application.
When thinking about what topics to cover, and how to cover them, I did some searching online to see which topics developers were labeling as being advanced. I distilled those results and included some of my own and came up a rough set of topics including multi-processing, magic methods, unit testing, debugging, context managers, et cetera. I wanted to come up with an application that was complex enough to cover these topics. However, I didn't want it to become so contrived that I'm just showing each of these features in their own context. I wanted this to be more like an actual development project.
The past few projects that I've worked on have been rather different from each other, however, they all had the same shape. By that, I mean they were all some form of data ingestion where data was ingested, processed, persisted, and consumed. I searched for public datasets that we could use to make something interesting here and I found a dataset from a Kaggle competition. It's an eight gigabyte CSV file containing articles from different publications over the last several years. It made me curious to know what information we could learn by analyzing the topics that were mentioned in each publication. I decided that I want to extract any named entities from these articles and count them for each publication. And I wanted to see with these counts if any patterns emerged.
Now, when I say named entities, what I mean in this context is just specific people, places, and things. Here's why I found this interesting. Extracting named entities from text is a non-trivial task. It's a CPU bound problem. Meaning, the CPU is always going to be our bottleneck. And in Python, using multiple CPUs is not a straightforward thing. However, because it will allow us to process more text at the same time, in this case, it's worth doing. I started sketching out the basic process. I envisioned the holistic system as being capable of accepting articles, processing them to extract the named entities, and saving the results to a database. And then, that data could be consumed with a web interface.
This course is going to focus on the ingestion and persistence. We'll cover the web consumer in another course. Once I had a rough sense for the system, I started doing some proof-of-concept work. I used different libraries to perform text extraction, and I used a library called Faker to generate some fake data to process so that I could start doing some basic performance benchmarking. Nothing too precise. Just rough numbers on how long entity extraction takes. The extraction process was pretty quick. I was processing fake articles that were around 40 to 50 words. So really, rather more of a Tweet than an article. However, the extraction process took less than half of a second usually.
The problem remained that this is still running on a single CPU, so that CPU is going to be our bottleneck. So I started testing with the multiprocessing module. Running the extraction in multiple processes allowed the code to utilize multiple cores, and that sped up the process, by roughly, the number of cores.
Now, multiprocessing doesn't give us all that power for free. Because we're running our code in multiple processes, sharing data is not that straightforward. We don't have easy access to shared memory. This required the addition of a message queue for passing data between processes.
After a week or so of testing, I landed on this. We have a web frontend protected by an API Key which allows us to submit an article. The frontend takes the article, it makes sure that the fields are not empty, and it adds it to a message queue, and we call this message queue the input queue. The backend creates a set of worker processes which are responsible for taking messages off of the queue and extracting the named entities. They'll cache the results until some number of articles has been processed, and then they'll transform the results and add them to a separate message queue.
We call this message queue the output queue. These worker processes are responsible for performing the CPU heavy work. The backend also creates a set of saver processes which are responsible for taking message off of the output queue and saving them to Cloud Firestore. There's not much CPU time required for saving the data, so by having it in its own process, we can separate out the CPU heavy work of entity extraction from the IO heavy work of persistence. During this course, we're going to build out this functionality.
I'm going to walk you through the way that I build apps. I'm not going to use terms such as the right way or the best way. Because, honestly, everything is contextual. So as we go, I'll show you how I build things, as well as my reasoning behind it. When I'm learning something new, I find that simply learning a concept exists is sometimes enough to fill in some knowledge gaps, and give me specific search terms for when I'm ready to learn more. So, I'm not going to explain every line of code. Hopefully, if you see something new that I don't actually cover in depth, it'll serve as a future search term.
The syllabus for this course is broken up into sprints because that mirrors the rough approach I follow when developing. Here's our breakdown. Sprint One will cover setting up a development VM so that it has Python 3.8. In Sprint Two, we'll be implementing the DataProcessor. In Sprint Three, we'll implement and test the Message Queue. Sprint Four will cover creating the Data Models, as well as creating a Shutdown manager. Sprint Five will cover creating the worker and saver processes in the backend. Sprint Six is going to cover creating the frontend API, and Sprint Seven will configure a Linux daemon to run the front and backend. Sprint Eight will cover deploying the application to a 16 core cloud virtual machine. In Sprint Nine, we're going to test the code and see how it performs under load, and we'll wrap all of this up with a Post Mortem where we'll talk about what worked, what didn't, and some of the lessons that were learned as I've built this out.
Here's what I hope that you'll get out of this course. By the end of the course, you'll be familiar with several useful standard library modules. You'll be familiar with several popular third-party libraries. You'll be familiar with some of the challenges revolving around multiprocessing. You'll be familiar with some Linux tools such as htop and strace, and you'll be able to add several new search terms to your educational road map.
Before starting, you'll want to be familiar with the programming language. Ideally, that will be Python 3. Knowing a bit about Linux will be helpful as well because we're going to spend a lot time on the command line. Knowing how to use Git will be helpful because that's where the completed code for this course resides. And other than that, as long as you're willing to fill in any knowledge gaps as you go along, you should be ready.
All of the source code for this course is available to download. You can check the course description for the links. I had so much fun creating this course! I hope that it is even a fraction as much fun to watch as it was to create. So, if you're interested in getting our development virtual machine configured, then I will see you in the first sprint!
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.