One of the best ways to learn new programming languages and concepts is to build something. Learning the syntax is always just the first step. After learning the syntax the question that arises tends to be: what should I build? Finding a project to build can be challenging if you don’t already have some problems in mind to solve.
Throughout this course, we’re going to learn more about Python 3 by building a data ingestion process. We’re going to go from setting up a development VM through to deploying the app to a 16 core, cloud VM where we can test. The application is going to allow us to submit articles to a front-end HTTP endpoint where they’ll be enqueued onto a multiprocessing queue. On the back-end, a set of worker processes will dequeue the articles, extract named entities from the article, and enqueue the results to be saved. A set of saver processes will dequeue the results and save the records to Cloud Firestore. Both the front and back-ends will be run together using supervisord. And we’ll use setuptools to create a software distribution used to deploy the app.
This course is broken up into sprints to give you a real-world development experience, and guide you through each step of building an application with Python.
If you have any feedback relating to this course, feel free to contact us at firstname.lastname@example.org.
- Configure a local development environment for an app using a VM
- Implement a data processor that can accept text, extract named entities, and return the results
- Implement a multi-process aware message queue and use Pytest
- Create data models to use as messages to pass on the message queue
- Create the backend for the application
- Create a web endpoint that we can use to enqueue our post models
- Implement a method for running the frontend and backend together
- Run the application using a dataset to see how the system performs under actual server load
This course is intended for software developers or anyone who already has some experience of building simple apps in Python and wants to move on to something more complex.
To get the most out of this course, you should be familiar with Python, ideally Python 3, and have some knowledge Linux and how to use Git.
Hello and welcome to Sprint 9. In this sprint, we're going to see how our system performs holistically under actual server load. In order to do that, we need a method to download our dataset as well as to upload its records to the front-end. Let's create a new package called simulator, and it's gonna have a dunder-init file. Let's create two more files, one is gonna be called download.py and the other is gonna be called upload.py. Our downloader is going to use the standard library to download and extract a zip file, which contains an eight-gigabyte CSV file. For our downloader here, we don't have a whole lot of code. We download the file, we extract it to a temp directory. And that's it. Once we have our dataset, we can use the uploader to iterate over the CSV file and send that data to our front-end. For our uploader, I'm using a couple of libraries here that I find are worth knowing. The first is called Typer. Typer allows us to easily create command line apps, and we'll see that in action shortly. The other library is called httpx. This is similar to the popular requests library, only it's an async library. Reviewing the code, at the top, I'm adjusting the csv.field_size_limit. Some of the articles in our dataset are going to exceed the default size, so this works around that. get_data opens the CSV file, and that yields a dictionary with the content and publication. upload_to_uri uses an async client to submit the data. I've disabled, SSL verification and enabled HTTP/2 to make things a bit faster. Also, I've set our API key here. Calling get_data returns as a generator. And by calling next, we can get rid of the CSV file's header row. Then we just loop for some number and post the articles. runner is responsible for kicking this off as an async process, as well as setting some default. And our main function uses Typer as an entry point. If we deployed this now, this isn't going to work. We still have to add our dependencies to setup.py. So I'm gonna paste these in here and I'll clean up the formatting. Okay, let's add the entry points as well. Okay, this is going to allow us to easily run our downloader and our uploader. The entry point for our downloader will be called getdataset. And it's gonna map to the download_and_extract function from the simulator package and the download module. The entry point for uploaddataset maps to simulator.upload and it calls main. Okay, let's build and deploy these changes. I'm lazy about making files executable, so I have a tendency of calling bash and just passing in the script. And I'm only mentioning in case you were wondering why I've not made them executable. Okay, let's deploy this from my host OS, and now let's switch to the cloud VM. So we now have an entry point, it's called getdataset and it will download our data. This is gonna take a while to download a roughly three-gigabyte file, and it's gonna take even longer to extract it. Okay, with this done, we have a CSV file, it's eight gigs, and it contains our dataset. Let's use supervisord to start up our application. And we'll specify the configuration file, which is in the directory that the app is deployed to. And using supervisorctl, we can check the status. And it looks like they're both running, they have the same uptime. So neither of them appears to be crashing. Looking at the logs, we can see that the workers have loaded spacy. Let's check out the supervisor configuration file. Okay, we have 14 worker processes. We have 15 saver processes and a cache size of 100. Just keep these settings in mind as we send some data through. So supervisor is running our front-end and back-end applications. We've used getdataset to download our dataset and extract it. Next, let's use the uploaddataset entry point to submit some articles from our CSV file. Check this out. If we call uploaddataset and pass in the help flag, notice we get this great usage info. We can set a URI, or we can use the default and we can set the record count or use its default. So where did all of this come from? This functionality comes from the Typer library. It uses type hints and keyword argument defaults to automatically generate this usage info. Now there's also a lot more functionality that it provides that we're not making use of; however, I find this to be a really nice way to get command line interfaces up and running. Let's run this, and we're going to submit 1000 results. And if we just keep checking our logs here, there, you can see that we have some records. So this means the system is working. Let's submit another 1000 results, only this time, we're gonna run it in the background. Using htop, we can get a better sense of what's happening. Let's use tree view, and we're gonna filter for Python 3.8. Okay, we have 16 CPUs, and most of them are at 100% utilization. We have 14 worker processes, which are all performing CPU bound work. So 14 of our 16 CPUs are basically being fully used by our worker processes. As soon as the worker's finished processing those 1000 records, the utilization is just gonna going to drop off, then the savers are going to kick in and they're gonna start writing to the database. The savers are going to use some CPU cycles, but really not much. Saving to a database is an IO problem, so most of the waiting time with the savers is spent sending data. Here it is. Notice these CPUs just dropped off to mostly single digit utilization. Now these savers are gonna keep running until they've saved everything. And then they're gonna go back to waiting for more data from the queue. If we check Firestore in the browser, we can see that we do have data. If we drill into one of these, we can see data is actually flowing through. We can use the UI to sort this by count. And currently at the top of the list is the name of the publication with the most records. Makes sense. US is mentioned 40 times, and so on. Let's delete this data and then start again with a clean slate. Also, let's shut down the daemon and then check the logs for any errors. Okay, it seems that we don't have any errors, though we do have some loggers in places that we really don't need them. We don't need to log every message, so let's clean that up. We'll just remove these references and save the file. Okay, currently we're running under a cache size of 100. Let's remove this setting in our config file and allow the back-end to use its default value of 25,000. Okay, so now we need to deploy this. We're gonna build it. And while that's building, let's see if Firestore is done deleting our data. Okay, it is. That's awesome. And now I'm gonna switch to my host OS so I can deploy this. And it's just gonna take a moment. And here it is. Now, switching back to the cloud VM. And I just wanna double check the logs, I wanna make sure I didn't miss anything. Okay, they look fine. If we run htop, we can filter by Python 3.8, and we can see there's nothing running. That's what we expected. I just like to double check. Now, let's start it back up using supervisord. And let's check its status. Okay, so it's been running long enough that that should be stable. Now let's submit 100,000 records and then see if we can get a peak into what our processes are doing. Notice most of the CPUs are fully used. We have supervisord running as the parent process, and all of our ingestion processes have a status of S. This indicates that they're sleeping. That clues us in that these are our saver processes. Notice they have a CPU time of zero since all they're doing is sleeping. Scrolling down, notice the R here, indicates that these are running processes. They all have roughly the same CPU time as well. These are our processes that are using most of our CPU. These two processes here, gunicorn and our uploader are the reason why we're only running 14 worker processes rather than 15, because I wanted to make sure we had some spare CPU cycles to actually connect in and interact with the server. So, because we know how this app functions, it allows us to piece together little clues like this that allow us to better understand what it's doing. However, just these glimpses alone don't really tell us what's happening. We can use a tool called strace to get a better sense of what a process is doing. And I just wanna make sure strace is installed. And it seems it already was, great! Let's run htop with sudo so that we can inspect these processes. If we find one of our worker processes, and it doesn't matter which. And with it highlighted, we press S and that's gonna be for strace. Notice, we get this glimpse of what's happening here. It's a bit low level, but even without knowing what's happening, we can infer quite a bit. We can see that some read function is being called and that it's referencing ingest.models, and this bit here seems like a byte size. So what I've gathered is these are our posts that our worker is reading from our input queue. So here we can see our workers are grabbing data from the queue and processing them. Let's look at a different process. Let's trace gunicorn this time so that we can see what it's doing. So we can see that it's waiting for requests. And we even see bits of the raw HTTP request that's being written. Okay, so strace is probably not gonna be part of your day-to-day tool set; however, it is a really nice tool to know when you find yourself trying to figure out what a process that seems like it's sitting idle is doing. It's gonna take a little while to process all of these records, so I'll check in with you periodically. Here we are 10 minutes later, still going. It's been almost another 10 minutes, and we're still going, though from my previous experiences, it should be done rather soon. And in fact, there it is. Okay, perfect timing. The CPU usage dropped so our workers are done. So now you might expect that the saver processes are going to kick in. Notice, though, we're not seeing any activity at all. The savers just aren't running. Here's why. We're using our default cache size, which is set to 25,000 records. We just submitted 100,000 records, but our cache is not being flushed. The reason is, the cache is per worker. We processed 100,000 records, but we divided that over 14 processes. So it seems none of them have reached 25,000 records to kick off the flush. If you recall, the app is designed to flush the cache whenever it's shut down. So let's shut down supervisor and let's check the logs. Okay, notice the mentions here? It is flushing the cache. Using htop, we can see that the saver processes are now running. And if we jump into the browser and hit reload, we do have some data. We've saved over 10,000 articles for vice and drilling into the results, it seems that the UI maybe is having a hard time keeping up. Honestly, that's okay. We're gonna explore the data in another course. For now, the fact that the data is flowing successfully through the system is what matters. Let's go back and see if our saver processes are still running. It seems that they are. And again, good timing. They've just wrapped up. Now, here's where things get fun. We told supervisor to shut down our app. It just finished saving the data to the database. So why are these processes still alive? Look, they're just sleeping. They're not even doing anything. What we've done here is we've orphaned these processes. Looking at the logs, we can see that the cache was flushed and the worker processes shut down. We know their worker processes because they share the same process ID as the process is mentioned here, and we know this log message only comes from our worker processes. Now, before we do anything else, we need to clean up these orphaned processes. We can use pgrep to confirm the PIDs of these orphans. Yep, there they are. We can see that there are some orphans running. And we can use pkill to terminate them. I believe pkill sends a SIGTERM by default, though you could also change the signal. Okay, let's just make sure that they're gone. And we're back to our baseline of 32 processes, so everything's shut down. All right, our goal for this sprint was to test this out in a production-like environment. We were able to use our downloader to get the dataset, we used the uploader to submit 100,000 records, and we watched the data end up in Firestore. So we've had some successes for sure; however, since we have these orphaned processes, we're pretty sure that we have a bug as well. In our next lesson, we're gonna have our post-mortem. And in that, we're gonna talk about why these processes were orphaned, we'll talk about some of the lessons that we've learned throughout the course thus far, and we'll talk about some next steps. So whenever you're ready to wrap up, I will see you in the final lesson.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.