image
Post Mortem
Start course
Difficulty
Advanced
Duration
1h 57m
Students
1828
Ratings
4.2/5
Description

One of the best ways to learn new programming languages and concepts is to build something. Learning the syntax is always just the first step. After learning the syntax the question that arises tends to be: what should I build? Finding a project to build can be challenging if you don’t already have some problems in mind to solve.

Throughout this course, we’re going to learn more about Python 3 by building a data ingestion process. We’re going to go from setting up a development VM through to deploying the app to a 16 core, cloud VM where we can test. The application is going to allow us to submit articles to a front-end HTTP endpoint where they’ll be enqueued onto a multiprocessing queue. On the back-end, a set of worker processes will dequeue the articles, extract named entities from the article, and enqueue the results to be saved. A set of saver processes will dequeue the results and save the records to Cloud Firestore. Both the front and back-ends will be run together using supervisord. And we’ll use setuptools to create a software distribution used to deploy the app.

This course is broken up into sprints to give you a real-world development experience, and guide you through each step of building an application with Python.

The source code for the course is available on GitHub.

If you have any feedback relating to this course, feel free to contact us at support@cloudacademy.com.

Learning Objectives

  • Configure a local development environment for an app using a VM
  • Implement a data processor that can accept text, extract named entities, and return the results
  • Implement a multi-process aware message queue and use Pytest
  • Create data models to use as messages to pass on the message queue
  • Create the backend for the application
  • Create a web endpoint that we can use to enqueue our post models
  • Implement a method for running the frontend and backend together
  • Run the application using a dataset to see how the system performs under actual server load

Intended Audience

This course is intended for software developers or anyone who already has some experience in building simple apps in Python and wants to move on to something more complex.

Prerequisites

To get the most out of this course, you should be familiar with Python, ideally Python 3, and have some knowledge of Linux and how to use Git.

Transcript

Hello, and welcome to our post mortem. I wanna take this lesson to review what we've built. In no particular order, I wanna talk about some of the lessons that we've learned throughout. I wanna talk about some bugs and I wanna talk about next steps. Throughout this course, we've built a basic data ingestion process that extracts named entities from text and saves the results to a Firestore database. Because the extraction process is a CPU bound problem, we used Python's multiprocessing library to run the extraction in separate OS processes. However, because of the added level of complexity, we needed a way for our processes to share data. To solve this, we used a multiprocessing Queue as our base for our own Queue wrapper, which is capable of allowing us to drain the Queue. When designing the app, I wanted to break out the front end from the backend because it allows them to be built and managed independently. For example, if I wanted to get rid of the HTTP server and only ever read from a file, we could just read from a file and then Queue that directly without having any of the fast API components. Before we could easily separate them however, we needed to be able to very simply end Queue messages from a separate process. If we had to implement an HTTP end point in the backend, just to end Queue that data, then we might as well use that as our frontend. So we needed a simple way to access the Queue from the frontend. We solve this using a multiprocessing manager to make the input Queue accessible from other processes. Once the input Queue was accessible over the network, it made it really easy to implement the front end, however we chose. I chose FastAPI because I found it very easy to use. It was well-documented, and really it was quite fast. Also I really like it's dependency management. I tend to like writing code that accepts dependencies rather than instantiating them directly. I find it tends to make code easier to test. So I found that that aspect of FastAPI resonated with me. I also liked that it uses pedantic for data validation because it means that we don't need to learn a lot of frameworks specific ways of doing things. Instead we use Python 3s type hints. When I build out apps these days with Python 3, I try and use type hints whenever it removes some ambiguity. In the case of data validation, type hints are a great way to make it clear exactly what's expected. Now, I don't use them for everything, you saw that. However, for frameworks, such as FastAPI, that are built around them, I do find it is worth the effort. I mentioned during the course, I really was not happy with the way that I ended up connecting to the Queue from the front end. I wanted to avoid creating a connection for every HTTP request, and I just didn't find a clean way to manage those connections. So this area really requires some refactoring. To secure the frontend we used an API key. For this demo, I hard-coded that value. That's fine in a demo. If this was a production app, I'd have passed this into the app. Maybe it would be an environment variable if there was only ever going to be the one, otherwise I'd wanna use something that allows for key rotation. We use Gunicorn to kick off a uvicorn worker that runs our app. Gunicorn is production ready, it's well-documented so it makes it a really natural choice. To perform our entity extraction we use a library called Spacy. I found Spacy to be very easy to use, especially compared to other libraries. It uses pre-trained machine learning models to process the text, which means those models need to be downloaded separately. We use Pydantic to create our data models. We use a Counter to store the extracted entities and Counters are just part of the standard library. And as you saw, they're basically just fancy dictionaries that know how to count. And one of the nice components is we can merge them together with the plus equals operator. For our database persistence we used Cloud Firestore. Each publication document has an article count as well as a collection to store the entities that were extracted. The entities are in their own documents, containing a word and account property. Most of the development that I did locally use the no persist flag. However, early on, I did download the service account key outside of my source code directory and set an environment variable to point Firestore to it. And I did find that useful when initially setting up the Cloud environment. The backend of this system is where most of our work is done in this app. It's also one of the components that have not been tested. We used PI test to test the Queue and shutdown watcher. However, we don't have full coverage here. In fact, as we found in the last sprint, we do have a bug somewhere that resulted in orphaned processes. This circles back to the introduction where I said, when you solve a problem with multiprocessing, you now have two problems, multithreading and multiprocessing introduces additional complexity. Now that doesn't mean we shouldn't use it, though it does mean that we'll need to make better use of unit tests so that we can make sure everything behaves the way we expect. The orphaned processes are likely caused in this case because our supervisor configuration is set to give the process 60 seconds to close gracefully before it attempts to kill the parent frontend and backend processes. When I shut down Supervisor, it triggered the saver processes, which ran for longer than 60 seconds. And I'm guessing at that point, it killed the parent process and it orphaned the children. Again, this is why testing is so important because once we start integrating different components, we start to see unexpected interactions. In our case, we were able to process the data and even save it to the database. And if I had shut down, Supervisord after those records had been persistent to the database, we never would have even seen the bug. It's really easy to get into a specific groove where we don't trigger errors in testing because we're running with very specific parameters. Another example of that is the worker process. I found that after we finished the buildup, I had forgotten to reset the cash count after flushing. Now the end result would be that we'd flush when it hit the cash size for the first time. However, we wouldn't do it again until we shut down. And this would have just kept building the cash app until we eventually ran out of memory. Again, this is why testing is so important for production level apps. I used argparse in the backend to accept the command line arguments for configuring the backend. I wanted to show how to get command line arguments using the standard library. When it came to the uploader, I chose to use Typer because I find it easier to use. But I did want you to see the difference between the two and in the amount of code. We use Python's set up tools to create our distribution for our app. And that included entry points for our backend, the uploader and the downloader. We used Supervisor to demonize our front end and backend processes, which allowed us to start and stop them as a single entity. Now, no code is perfect. And the more code that exists, the more likely we are to have bugs. The worker and saver processes are in need of testing to make sure that they function correctly. Having orphaned processes could make it difficult to restart the demon in the future. Imagine if our backend process binds our input Queue to port 50,000, and then that process was orphaned when we were shutting things down. When we attempt to start the application again, it's going to raise an exception because that port is already in use. When I develop, I like to develop in sprints and I like to continually evaluate what's working and what's not. And it's at those milestones I like to reevaluate everything holistically. The last sprint was a milestone. We got our minimum viable product into a production like environment for testing. Testing at scale helps find bugs that are really tough to produce in development. And this is why I like to get my code base into a production like environment as early as possible. Now what's next. If this was a project for a customer, at this stage, I'd distill all of this information, and then I'd create a plan for the next set of sprints. I'd move testing to the top of the list, starting with tests to ensure we don't have any orphaned processes. I'd spend some time working on that connector in the front end to make that more resilient. I'd also allocate a sprint for code review and cleanup. I'd go through all of the code and try and remove anything that's not required. Then I'd go back through again and attempt to rewrite anything that I find to be difficult to understand. The goal there is to make the code more clear for the developers who will be maintaining it. Now with all of that data, having flown through the system, we really haven't spent any time reviewing the data that we extracted. And the reason is we're going to do that in another course where we build a web frontend for this data set. Okay, so what's next. If you've enjoyed working on this so far and you want to keep going, I recommend adding a test suite. I find that writing tests for my code teaches me more about the code than anything else. So if you're looking to keep going, that's a great exercise. All right, we've built a lot together throughout this app. However, that is going to wrap up this lesson and with it the course. Hopefully developing this start to finish with me has been valuable to you. I've had a lot of fun making it and I hope you've enjoyed it. Thank you so very much for watching and I will see you in another course.

Lectures

Course Introduction - Sprint 1 - Sprint 2 - Sprint 3 - Sprint 4 - Sprint 5 - Part One - Sprint 5 - Part Two - Sprint 6 - Sprint 7 - Sprint 8 - Sprint 9

About the Author
Students
100719
Labs
37
Courses
44
Learning Paths
58

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.