I can clearly remember that moment when our VP of Engineering came to me saying, “Our data and reporting are a mess.” At that moment, we admitted that we had to improve how we manage and organize data and, even more importantly, how we provide it to our customers.
You know, when a company turns the corner and in a really short amount of time goes from being a start-up to becoming a really important player doing business with big companies, everything needs to be evolved, transformed, and improved. This is fair, and everyone would love to be involved in such interesting processes, including me. But (there is always a but) each new refactoring, each new improvement brings challenges, study, and risks to be taken into account, and this was true when I started to analyze and design a new reporting system, handling both the data — in terms of moving and transformation — and the system to provide them to our customers.
The first thing I addressed when I started the analysis was the way we stored reporting data and how we were getting the data to build dashboards and reports. Basically, all the information — even the semi-aggregated data — was stored inside the main database together with the operational data. Moreover, the software modules in charge of getting the data and building reports and dashboards were part of the main backend system.
In a scenario where there aren’t many concurrent users and the number of records is not in the hundreds of thousands, this approach — while not ideal — is not necessarily wrong.
Fortunately for us, our number of concurrent users increases every day, and together with them the amount of data we host. This means that we need to completely change our approach to data and business intelligence in general.
Honestly, since the beginning, I started designing the new architecture following a canonical approach, with these components in mind:
With that in mind, I started to outline the architecture in collaboration with my colleagues.
The need to set up a read replica as the database to stress with exports and long queries was quickly and easily accepted. After some budgeting considerations, our infrastructure team proceeded to implement this.
With the read replica in my backpack, I moved to the next friend: the tool to run and orchestrate ETLs.
To find the best fit for our needs, I asked our Data Engineer, Alessandro, for help. He did a great analysis on the available alternatives, and together we went through the shortlist. Basically, we had to choose among tools addressing the problem from different perspectives and having different “souls” or core foundations (e.g., cloud-native, service or bundled application, declarative or coding-based, etc.).
In our view the best fit was, and actually still is, Prefect: an open-source workflow management system that’s Python-based. (I didn’t mention it earlier, but Python is our most-used backend programming language.)
The infrastructure and the data teams tried several different configurations to integrate Prefect into our ecosystem, ultimately landing on the following setup:
Nice — we put in place a modern system to orchestrate and schedule flows. What came next? The ETLs, of course!
During the ideation phase, I imagined moving data from the source database to the reporting one, implementing a really old-fashioned module based on stored procedures: hard to code, a nightmare to maintain, but efficient, reliable, and data-focused by definition.
Luckily for me, Alessandro came to me with a proposal that changed the game: Why not use dbt? Honestly, I wasn’t familiar with dbt, so I had to study a bit. Doing the free course they offer at dbt, I learned about this wonderful tool that solves the data moving and transformation problem in an elegant and ridiculously simple way.
What you need to do is to define your models in terms of SQL queries, and with the help of some templating feature leveraging jinja, wire them building implicitly how they depend on each other.
Once the tool runs, it automatically builds the dependencies graph and processes the models one by one in the right order, so at the end, the final tables are built and filled with the desired data.
The key concept is that all the transformations are expressed using an elegant declarative approach, simply writing SQL queries.
Moreover, the tool allows you to define different kinds of “persistence” for the models you define:
I encourage you to check out this tool and the features it offers since I can’t list them all here.
So, wrapping up, we defined a read replica database, picked an orchestration system for ETLs, and found a nice tool to concisely write our data flows. At that point, I was almost ready to ask about the configuration of the database for the new business intelligence platform. But talking to other colleagues, we thought to leverage another project running in parallel with this initiative, so we decided to use a Redshift cluster as the destination for our data.
You know Redshift is powerful, even if it has some limitations in terms of interoperability with other databases, but I have to say that it offers something that really simplifies our data flow: federated queries.
Using this feature, we could hook up our PostgreSQL read replica into Redshift, which then provided it as it was simply a local schema of the data warehouse itself.
This could seem like a small thing. After all, data have been moved between databases since the first one was implemented. But this helped us to move from an ETL approach in favor of the ELT one.
To remind you of the difference between them:
Both of them have pros and cons, but between the option of leveraging a cloud-native data warehouse like Redshift and a declarative tool based on SQL like dbt, the second approach is dramatically more natural and easy to implement, especially if federated queries fit the scenario.
Working in this scenario simplifies the work a lot because the end-to-end process is clear and the responsibilities are well-defined and segregated:
At this point, I would start to wonder more about performance. By having the federated database on the same VPC as the redshift cluster, this configuration is really performant. For example, one of our flows that transfers approximately eight million records, transforms them, and applies some joins and aggregations producing a couple of fact tables of a similar cardinality, takes around five minutes end-to-end. The great thing is that refreshing data with this approach is a piece of cake because the final users experience zero downtime; dbt seamlessly replaces the old version of a final table with the new one.
As I mentioned earlier in this post, the first design of the whole system architecture included a dedicated, brand new microservice to provide data inside our ecosystem, capable of efficiently building exports and performing data retrieval using the Redshift cluster. Well, as we say in Italy, “appetite comes with eating,” and after considering the many things we could do with our new data platform, we thought we could hook up our data warehouse with a modern and feature-rich BI platform.
As a software engineer, I was a bit sad realizing that the chance to build a new application from scratch was fading away. But my experience told me it was a good idea in terms of timing, features, and quality we could deliver to our customers using an existing mature product.
Next, we went through a series (luckily not too long) of PoCs evaluating some modern BI platforms trying to understand if one could fit our needs. It was an interesting phase of the project that gave us the chance to see different approaches and technologies in the BI space. After a couple of months, we stepped into the platform that we ultimately picked: ThoughtSpot.
We have been impressed by their approach to BI. The main idea they put in place is to let the user get what they need directly from data by a search-based interaction. The user can literally get the data by using a search bar, and while typing, the tool is capable of generating the queries and detecting the best way to represent the retrieved data building stunning pinboards (aka dashboards).
We saw in ThoughtSpot an evolution in Business Intelligence, and planned to integrate the platform, potentially using all the possibilities it offers:
Right now we are in the first project phase, leveraging the first two options. In the next project phase, we hope to enable our users to free dive into their data of interest inside their Cloud Academy space.
It's Flash Sale time! Get 50% off your first year with Cloud Academy: all access to AWS, Azure, and Cloud…
In this blog post, we're going to answer some questions you might have about the new AWS Certified Data Engineer…
This is my 3rd and final post of this series ‘Navigating the Vocabulary of Gen AI’. If you would like…