Exploring the Data Access Layer
Start course

This course is the second course in two-part series on how to build an application in Python. In the first course, we built a data ingestion process that extracted named entities from articles across a few different publications. We extracted named entities from around 100,000 articles and we saved the results into Cloud Firestore. In this second course, we'll explore the codebase for a web application used to visualize those results.

We'll kick off the course by checking out some quality of life changes implemented while developing this app. That includes a custom bash theme, a replacement debugger, a debugger command for starting an IPython shell, and pytest plugins. After that, we're going to review the data access layer and its accompanying tests. That's going to include multiple implementations of each data access service. Then we'll check out Python's web application standard.

Next, we'll review the web application layer and its accompanying tests. That's going to include a fast web application framework, custom middleware, request hooks, and application configuration. After that, we're going to review the presentation layer, including a Vue.js app and materialize CSS. Finally, we're going to run the app locally and trace some requests through the application using the debugger.

If you have any feedback relating to this course, feel free to contact us at≥

Learning Objectives

  • Implement a few developer quality of life changes 
  • Implement a testable data access layer 
  • Understand how a Python web app operates 
  • Understand how to build and test a more complex web app
  • Understand how to use ipdb and IPython
  • Enhance your knowledge of the Python programming language

Intended Audience

This course is intended for software developers or anyone who wants to learn more about building apps with Python.


  • Before taking this course, please make sure you have taken the first course in this two-part series: Building a Python Application: Course One
  • You should also have an understanding of Python 3, Linux CLI, HTML/JS, and Git


The source code for the course is available on GitHub.


Hello and welcome! In this lesson, we're going to be exploring the data access layer for our application. This application has two data storage requirements. We need read access to the dataset in Firestore, and we need write access to blob storage, which we'll use for storing word cloud images.

The blob storage that I'm going to use for this application is Cloud Storage. We're going to use the Google Cloud Python SDK to access both Firestore and Google Cloud Storage. Both exist on the Python Package Index.

Let's talk about our development and testing strategy for this application. The backend of this application is broken out into two parts. We have our data access layer, which is responsible for getting data from Cloud Firestore, and it's also responsible for saving images to blob storage. The web application layer, which is responsible for accepting HTTP requests, and returning data by using the data access layer.

The web application layer has a dependency on the data access layer. And these dependencies impact the way that we test. Unit tests are meant to validate the assumptions we make about the discreet building blocks that make up our application. By that, I'm talking about testing functions, methods, and in general, the mechanics of the classes that we implement, and that's in order to prove that they behave the way we assume they do.

Unit tests are intended to be to run during development to give us a level of assurance in the codebase. That means they should run quick enough that they don't become a bottleneck. Now, typically to do that, we use fake dependencies. For example, rather than using an actual database implementation, we might use a mock version.

The data access layer for this application follows this approach. It includes fake implementations of classes that rely on external services. By passing dependencies as arguments, we can easily change the underlying implementation. In the data access layer, the DataStorage class requires a database client to be passed in.

In the web server layer, the web resources require data storage objects to be passed in. Check out this class here. It's called NoOpDataStorage and it represents what we need the DataStorage class to be. Specifically, it implements a publications method, which in this case, simply yields 10 Publication objects. The data models for this app are extremely basic. We're just using named tuples.

So, publications yields Publication tuples. The word_counts method yields WordCount tuples. And the frequencies method returns a dictionary. Having this NoOp version adds some more code to our code base. However, I think that in this case, the code is worth the effort. This class demonstrates the minimum required functionality for the DataStorage class.

If this code works as we expect, we can use it as a blueprint for creating an implementation that uses Firestore, or really, any other data source we might want. Besides serving as a blueprint, the NoOp version also can be used for local development and testing, by passing it as a dependency to the web layer.

Looking at the Firestore implementation, notice we have the same methods. These are basically an implicit interface. As long as these methods exist and they follow the same arguments and return the same types, we can change the underlying implementation. The constructor for this version accepts a Firestore client.

The publications method returns all of the publication documents in the publications collection. It uses the document ID as the name, it gets the count from the document's count property, and it generates a URL path and image name based on the document ID. We'll check this function out in a little bit.

The word_counts method accepts a publication, a record limit, and checkpoint, and it returns a Generator. The checkpoint argument is a tuple containing a word and count. Firestore allows us to specify properties that will serve as a checkpoint. And this allows us to use this as a paging mechanism.

At the end, notice, this is basically the same as the NoOp version. The only difference is we're getting actual data from Firestore rather than generating fake versions. This frequencies method here is just a convenience method to allow us to turn the data into the shape we need. This is one of the formats accepted by the Word Cloud library.

So we can use this to generate the format we need in a single call. These two implementations, they follow the same interface. The difference being that the Firestore version accepts a client object and it fetches data from that client. Because these are intended to behave the same way, we need to test them with the same tests.

Let's review the tests for these classes. The tests here that accept a data_storage argument, these are our data storage tests. The data_storage argument is a function scoped fixture. Meaning the fixture is run for each test. This fixture is interesting in that these params allow us to specify multiple data storage classes. 

By specifying these two classes, Pytest will run each test twice, once for NoOp version, and once for the Firestore version. This request argument is passed in by Pytest, and we can access the parameters through the param property. Since our parameters are classes, we can treat the params property as such and instantiate it.

So, each of the functions that have a data_storage argument are going to be run twice, once for NoOp, and once for Firestore. To make it easier, we're passing the client into both versions, even though the NoOp version is just not going to use it. The client here is a fake version of Google's implementation of the Firestore client.

It implements just the methods that we're using in the DataStorage class. Since Google's library is a fluent interface, we're imitating that by returning self for most of these functions. The idea for this class is to have a minimal version that roughly behaves the same way that Google's does. This version here is kind of at the edge of where I feel that it's worth developing for myself.

If we needed to use a lot more functionality from the client, I'd start looking at a more complete open source Mock. By the way, I'm using this syntax here to allow any arguments and keyword arguments to be passed in, and the underscores here are just supposed to imply that we're not actually using them.

Alright, so the NoOp version is really only ever going to return our hard coded 10 results. The Firestore version is going to get its data from the client, and if both of these behave the same way when tested, it gives us a bit more confidence in using both versions. Which means if they behave the same way, we can put more trust into our NoOp version for local development.

The test_publications function iterates over the publications and asserts that the count matches our index. This works because our NoOp version loops from zero to nine, and it uses its loop index for each count. The Firestore version gets its data from the client, which has four hard coded values ranging from zero to three, and we're also asserting that the name and image_URL are what we expect.

Recall that the publications method accepts a string argument. That's what we're testing here. It's the same test as above, only its asserting that the image_URL includes the provided bucket name. Testing word counts is similar. We loop over them and assert that the values are what we expect.

Testing the checkpoint argument of the word counts is a bit more involved, however, all it's doing is requesting only records starting after the supplied word and count. This will tell us if we're actually getting the next record in the list or not. Running this shows that all of our tests are passing, and we can see that it's running both versions of our data source.

Let's review the rest of the functionality in this module. There are two implementations of BlobStorage. Again, one fake, one real. These are both pretty basic, they provide a save method. The actual implementation is going to save the image's bytes to Cloud Storage as a png file.

Image generation is done with the WordCloud library and is handled with the generate_word_cloud function. It accepts a dictionary containing the frequencies, and an optional format, height and width. The return value depends on the format string and includes options for bytes, raw and image. Bytes returns the image as bytes, raw returns a WordCloud object, and image returns a PIL image.

The tests for this function only test the return value data types. The function for converting the PIL image into bytes uses the IO module and passes a BytesIO object to the images save method, and this allows save to write to our buffered IO rather than to disk, and then it returns the bytes. The test for this generates a black square that's 128-by-128, and then it asserts that the results of converting it into bytes matches what we've previously generated up here.

The get_client function accepts a string and returns a client. The types are either db or blob, and if neither is given, we use the default of db, and then we raise for any other type. The tests for this function just check that the types match. Using this decorator, we can specify that multiple arguments are passed in here as underscore in and underscore out.

The names aren't important, those are just the names I chose, and Pytest is going to match these with the comma separated names here. The image_URL_path function accepts a publication id, which is the name of the publication, in this case, and then it takes an optional path. Then it MD5 hashes the id, and while MD5 is no longer considered secure, that really won't matter in this use case.

We jump through a few hoops here to try and ensure that the value passed in for the bucket name is correctly added to the URL so that we can append it to our base image URL later. If we look at the test, you can see some of the concerns. I want to ensure that we can specify a storage bucket where the image reside.

The first version of this that I wrote, it was just too basic, and it broke if I wasn't paying careful attention to the forward slashes. So I wanted to make this version a little bit more robust, and so with these tests, I can prove that this function builds the URL the way that I expect.

Okay, we've covered a lot in this lesson. Let's summarize before moving on to the next one. This app has two data sources. It has data in Cloud Firestore and it has data in Cloud Storage. Firestore holds the publication and entity data, and Cloud Storage is where we store the word cloud images.

We have two versions of our DataStorage and two versions of BlobStorage. Each has a simplistic version which doesn't rely on external data sources, and then there's a version that does. The NoOp versions demonstrate the most basic requirements for the class, while doubling as a development and test implementation, and each of the actual implementations expects to be passed a data access client. This allows us to pass in either a real or fake client, which allows us to test both with and without access to the production data sources.

Okay, in our next lesson, we're going to review the Python web application standard. So, whenever you're ready to keep going, I will see you in the next lesson!


Course Introduction - Quality of Life for Developers - What Is It That We're Building? - The Web Server Gateway Interface - Exploring the Web Application Layer - Exploring the Front End Code - Running the Web App - Summary / Next Steps

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.