I can clearly remember that moment when our VP of Engineering came to me saying, “Our data and reporting are a mess.” At that moment, we admitted that we had to improve how we manage and organize data and, even more importantly, how we provide it to our customers.
You know, when a company turns the corner and in a really short amount of time goes from being a start-up to becoming a really important player doing business with big companies, everything needs to be evolved, transformed, and improved. This is fair, and everyone would love to be involved in such interesting processes, including me. But (there is always a but) each new refactoring, each new improvement brings challenges, study, and risks to be taken into account, and this was true when I started to analyze and design a new reporting system, handling both the data — in terms of moving and transformation — and the system to provide them to our customers.
Chapter 1 – What the…?
The first thing I addressed when I started the analysis was the way we stored reporting data and how we were getting the data to build dashboards and reports. Basically, all the information — even the semi-aggregated data — was stored inside the main database together with the operational data. Moreover, the software modules in charge of getting the data and building reports and dashboards were part of the main backend system.
In a scenario where there aren’t many concurrent users and the number of records is not in the hundreds of thousands, this approach — while not ideal — is not necessarily wrong.
Fortunately for us, our number of concurrent users increases every day, and together with them the amount of data we host. This means that we need to completely change our approach to data and business intelligence in general.
Honestly, since the beginning, I started designing the new architecture following a canonical approach, with these components in mind:
- A read-only database replica from which to fetch the relevant raw data. This avoids overwhelming the main database architecture with heavy operations like big queries and exports of numerous records.
- A tool to orchestrate and execute ETLs.
- A brand new database to host all the data about reporting and dashboards.
- A new microservice to provide APIs in order to get data for dashboards and to build reports from raw exports.
With that in mind, I started to outline the architecture in collaboration with my colleagues.
Chapter 2 – Each job requires its own tools
The need to set up a read replica as the database to stress with exports and long queries was quickly and easily accepted. After some budgeting considerations, our infrastructure team proceeded to implement this.
With the read replica in my backpack, I moved to the next friend: the tool to run and orchestrate ETLs.
To find the best fit for our needs, I asked our Data Engineer, Alessandro, for help. He did a great analysis on the available alternatives, and together we went through the shortlist. Basically, we had to choose among tools addressing the problem from different perspectives and having different “souls” or core foundations (e.g., cloud-native, service or bundled application, declarative or coding-based, etc.).
In our view the best fit was, and actually still is, Prefect: an open-source workflow management system that’s Python-based. (I didn’t mention it earlier, but Python is our most-used backend programming language.)
The infrastructure and the data teams tried several different configurations to integrate Prefect into our ecosystem, ultimately landing on the following setup:
- A Prefect server running on ECS tasks (it is made of several components)
- A PostgreSQL database dedicated to Prefect
- A Prefect Agent listening for jobs to run; for each job run, it spawns an ECS task executing the desired Prefect flow
- Some Jenkins pipelines to manage the platform and deploy the flows on the server
Nice — we put in place a modern system to orchestrate and schedule flows. What came next? The ETLs, of course!
During the ideation phase, I imagined moving data from the source database to the reporting one, implementing a really old-fashioned module based on stored procedures: hard to code, a nightmare to maintain, but efficient, reliable, and data-focused by definition.
Luckily for me, Alessandro came to me with a proposal that changed the game: Why not use dbt? Honestly, I wasn’t familiar with dbt, so I had to study a bit. Doing the free course they offer at dbt, I learned about this wonderful tool that solves the data moving and transformation problem in an elegant and ridiculously simple way.
What you need to do is to define your models in terms of SQL queries, and with the help of some templating feature leveraging jinja, wire them building implicitly how they depend on each other.
Once the tool runs, it automatically builds the dependencies graph and processes the models one by one in the right order, so at the end, the final tables are built and filled with the desired data.
The key concept is that all the transformations are expressed using an elegant declarative approach, simply writing SQL queries.
Moreover, the tool allows you to define different kinds of “persistence” for the models you define:
- Table: Each time the model generates a brand new table, that replaces the existing one.
- Incremental: Performs inserts and updates on an existing table; if it does not exist, then the first run creates it.
- View: The model creates a regular view in the database instead of a table.
- Ephemeral: When executing, the tool will get the SQL code of the model and will merge with the code of the model using it. This is useful to increase reusability and code readability.
I encourage you to check out this tool and the features it offers since I can’t list them all here.
So, wrapping up, we defined a read replica database, picked an orchestration system for ETLs, and found a nice tool to concisely write our data flows. At that point, I was almost ready to ask about the configuration of the database for the new business intelligence platform. But talking to other colleagues, we thought to leverage another project running in parallel with this initiative, so we decided to use a Redshift cluster as the destination for our data.
Chapter 3 – ETL vs ELT
You know Redshift is powerful, even if it has some limitations in terms of interoperability with other databases, but I have to say that it offers something that really simplifies our data flow: federated queries.
Using this feature, we could hook up our PostgreSQL read replica into Redshift, which then provided it as it was simply a local schema of the data warehouse itself.
This could seem like a small thing. After all, data have been moved between databases since the first one was implemented. But this helped us to move from an ETL approach in favor of the ELT one.
To remind you of the difference between them:
- ETL flows first perform the data Extraction, apply Transformation logic and then Load them into the destination.
- ELT flows, instead, perform an Extraction to Load a copy of the original data into the destination where the Transformation then is applied.
Both of them have pros and cons, but between the option of leveraging a cloud-native data warehouse like Redshift and a declarative tool based on SQL like dbt, the second approach is dramatically more natural and easy to implement, especially if federated queries fit the scenario.
Working in this scenario simplifies the work a lot because the end-to-end process is clear and the responsibilities are well-defined and segregated:
- Prefect is in charge of just taking care of the flows scheduling and their internal and external orchestration.
- Redshift is in charge of playing the role of a single and global place for data, abstracting the other origin data sources as local schemas.
- dbt flows are in charge of moving data from source tables into a staging area, getting the staged records, and applying the required transformations to build the final tables (dimensions, relations, facts, etc.). The flows do not care at all where the source tables really are because of the federated queries.
At this point, I would start to wonder more about performance. By having the federated database on the same VPC as the redshift cluster, this configuration is really performant. For example, one of our flows that transfers approximately eight million records, transforms them, and applies some joins and aggregations producing a couple of fact tables of a similar cardinality, takes around five minutes end-to-end. The great thing is that refreshing data with this approach is a piece of cake because the final users experience zero downtime; dbt seamlessly replaces the old version of a final table with the new one.
Chapter 4 – Let’s use our new wonderful data!
As I mentioned earlier in this post, the first design of the whole system architecture included a dedicated, brand new microservice to provide data inside our ecosystem, capable of efficiently building exports and performing data retrieval using the Redshift cluster. Well, as we say in Italy, “appetite comes with eating,” and after considering the many things we could do with our new data platform, we thought we could hook up our data warehouse with a modern and feature-rich BI platform.
As a software engineer, I was a bit sad realizing that the chance to build a new application from scratch was fading away. But my experience told me it was a good idea in terms of timing, features, and quality we could deliver to our customers using an existing mature product.
Next, we went through a series (luckily not too long) of PoCs evaluating some modern BI platforms trying to understand if one could fit our needs. It was an interesting phase of the project that gave us the chance to see different approaches and technologies in the BI space. After a couple of months, we stepped into the platform that we ultimately picked: ThoughtSpot.
We have been impressed by their approach to BI. The main idea they put in place is to let the user get what they need directly from data by a search-based interaction. The user can literally get the data by using a search bar, and while typing, the tool is capable of generating the queries and detecting the best way to represent the retrieved data building stunning pinboards (aka dashboards).
We saw in ThoughtSpot an evolution in Business Intelligence, and planned to integrate the platform, potentially using all the possibilities it offers:
- API calls to export data
- Embedding of pinboards
- Embedding the search experience in our platform
Right now we are in the first project phase, leveraging the first two options. In the next project phase, we hope to enable our users to free dive into their data of interest inside their Cloud Academy space.
- Prefect – https://www.prefect.io/
- dbt – https://www.getdbt.com/
- Amazon Redshift Query Federation – https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html