One of the best ways to learn new programming languages and concepts is to build something. Learning the syntax is always just the first step. After learning the syntax the question that arises tends to be: what should I build? Finding a project to build can be challenging if you don’t already have some problems in mind to solve.
Throughout this course, we’re going to learn more about Python 3 by building a data ingestion process. We’re going to go from setting up a development VM through to deploying the app to a 16 core, cloud VM where we can test. The application is going to allow us to submit articles to a front-end HTTP endpoint where they’ll be enqueued onto a multiprocessing queue. On the back-end, a set of worker processes will dequeue the articles, extract named entities from the article, and enqueue the results to be saved. A set of saver processes will dequeue the results and save the records to Cloud Firestore. Both the front and back-ends will be run together using supervisord. And we’ll use setuptools to create a software distribution used to deploy the app.
This course is broken up into sprints to give you a real-world development experience, and guide you through each step of building an application with Python.
The source code for the course is available on GitHub.
If you have any feedback relating to this course, feel free to contact us at firstname.lastname@example.org.
- Configure a local development environment for an app using a VM
- Implement a data processor that can accept text, extract named entities, and return the results
- Implement a multi-process aware message queue and use Pytest
- Create data models to use as messages to pass on the message queue
- Create the backend for the application
- Create a web endpoint that we can use to enqueue our post models
- Implement a method for running the frontend and backend together
- Run the application using a dataset to see how the system performs under actual server load
This course is intended for software developers or anyone who already has some experience of building simple apps in Python and wants to move on to something more complex.
To get the most out of this course, you should be familiar with Python, ideally Python 3, and have some knowledge Linux and how to use Git.
Hello, and welcome to sprint four. In this sprint, we're going to create the data models that we'll use as messages to pass on our message queue. We'll create a shutdown watcher and we'll stub out the persistence functions. In the previous sprint, we created a message queue named QueueWrapper. It enables us to pass messages between processes. Messages can be anything that can be pickled. Our front end is going to allow messages to be submitted to a web endpoint.
So I wanted to be able to validate the data before end-queuing. For this part, we're using a library called Pydantic to implement the data models. Pydantic uses Python's type hints to allow for data validation. Recently for Python apps, I've been using Pydantic for data model definition because it's minimal, but very powerful because it uses type hints, which are part of Python now. We don't need to learn a module specific way of doing things.
Let's create a file for our models called models.py and I'll paste in our template. Let's install Pydantic. Great, Pydantic's base model allows us to define the properties of our model. For this app, the front-end is going to put a post object onto the input queue. A worker is going to take a post from the queue, it's going to process the content using our data processor, and it's going to return a processed post. Post has two properties, a required content property, that's going to be a string and a required publication, also a string. These are required because we haven't provided a default value if none is given. The processed post we'll also have a publication, it'll have an entities property that will store a counter, and will set a default and this allows this property to be optional.
Recall that our data processor creates a counter with all of the extracted entities, that's what this is going to hold. We also have an integer property here to track the number of articles. Now, this may seem confusing at first, the post model represents a single article. The processed post represents one or more articles for a given provider. Later on, this will allow us to merge the results into one object. If this doesn't make sense right now, don't worry about it, it will later, for now, this is all we need, so we'll loop back on this functionality in another sprint.
Okay, with our data models as complete as we need them for now, we can move on to the next component on the list, which is going to be our shutdown watcher. So let's talk about what that is and why we need it. The backend component that we're going to work on in the next sprint is going to be responsible for managing our worker processes. It's going to be the parent process and our workers are going to be its children. All of the real work is done by the children here. It's the parent's job to simply stay running until it receives an OS signal, telling it to shut down. So we need a method to keep that parent process running. And that's what this shutdown manager is going to do. I chose to make this a context manager, because for me, I find it helps to remove some of the ambiguity when reading it in code. I'll explain why later, after we implement it.
Let's start by creating a Boolean property, which will determine if the code should continue to loop, or if it should stop. Next, we can use the signal library to register the OS signals that we listen for. We're going to listen for SIGTERM and SIGINT, and if we receive either of those, we're going to call the exit function. Notice, we're not calling the exit method here, we're just passing in a reference to it, that the signal handler is going to call for us when it receives one of these signals. To make a class into a context manager, we need to implement the magic methods, dunder enter and dunder exit.
By the way, dunder just stands for double underscore. These allow us to perform some basic setup and tear down. Notice here in the comment, it shows us the basic usage for this class when it's complete. On this line here, the width keyword kicks off the dunder enter method. Because we're returning self it also allows us to use the as keyword so that we can make a reference to the watcher itself. The dunder exit method is going to call our exit method. So nothing fancy here. And the exit method is going to be responsible for just setting this, should continue value to false.
Now let's implement our core functionality here, which is the serve forever method. While, should continue is true, we're going to sleep for 0.1 seconds. Okay, let's circle back and talk about context managers. This functionality could be implemented in many different ways. We could have a function called serve forever, which will block until it's signaled to stop. The end result would really be the same thing. I chose a context manager here because I can look at it and intuit that no code outside of its scope that follows it is going to run until it has completed. So for me, it's just a little easier to intuit.
Alright, when we use this shutdown manager, we can keep our application up and running until the operating system signals it to stop. Let's test this out with py test. We'll create a file called shutdown watcher. Underscore test.py, and I'm going to paste in our code. So we have a couple things here. We have our fixture, which is going to create our shutdown watcher and we have our test. It's just the one test that is going to run twice. Once for each of these parametrized signals. Py test is going to match the string value here of SIG to the argument name. The test starts by asserting that should continue is true for new instances of shutdown watcher. By design, no code after the shutdown watcher starts serving forever is going to run. That means we need to schedule sending our signal to our process ahead of time. These schedule library makes that pretty easy.
After 0.1 seconds, this Lambda function is going to run and it uses os.kill to send a signal to our process. So we enter the shutdown watcher's context and just before calling serve forever, we first run the scheduler, which will wait 0.1 seconds before sending a signal. And then we call serve forever. As soon as the signal is received, the shutdown watcher stops blocking, and we can assert that should continue is now false. Let's run this test to help visualize. Previously we ran py test and allowed it to find our tests for us, this time let's just specify this one file. Okay, a bit underwhelming, two tests ran a hundred percent past. Let's add a verbose flag and actually see what was run.
Notice that it's the same function, it's just with different signals. If we were to add another signal to the test parameters, we'll get to see this fail. Let's add another signal. This one seems fine enough, SIGILL. Okay, and let's run this. Notice this error references an legal instruction. Scrolling up, we can see that the first two tests passed and then we get this crash. Our shutdown watcher doesn't listen for SIGILL. So when the process received it, it did whatever it was supposed to do by default. If we modify the shutdown watcher to include SIGILL, and we run this test again. Notice how all of the tests pass, because now our app knows how to actually handle that type of event.
Alright, let's remove these references to SIGILL and we'll run our tests again. Alright, so everything passed. Let's run it again from the ingest directory so that we get all of our tests. And we can see that we have six tests and that they're all passing. Awesome, so that is two tasks done for this sprint. Our final task is going to be to stub out our persistence functions so that we can use them in the next sprint. The file we're going to use is called persistence.py. And we're going to use cloud firestore for database persistence.
So let's install that module, we'll run pip install. Google cloud firestore. Okay, great. Let's run through these functions. Persistence no op is just a dummy function that we can call in our development environment when we really don't wanna write to the database, this just makes testing a bit easier. The get client method is going to return a new firestore client. Because this code is going to run inside of Google cloud in production, we don't need to actually make any configuration changes to the client, it will use the credentials of the service account, which are used by compute engine. We're actually gonna implement these in another sprint, but for now we have the stubs that we can use to call these functions to test that our integration works.
With that done, we have some data models that we'll be able to use as our messages. We have a shutdown watcher that will block until it receives a SIGTERM or SIGINT. And we have our persistence functions stubbed out so that we can use them in our further testing. In the next sprint, we're gonna put all these components together that we've created so far to implement the backend. So whenever you're ready to do some further building, I will see you in the next sprint.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.