How Do We Transform and Model Data at Cloud Academy?

How Do We Transform and Model Data at Cloud Academy?
“Data is the new gold”: a common phrase over the last few years. For all organizations, data and information have become crucial to making good decisions for the future and having a clear understanding of how they’re making progress — or otherwise.
At Cloud Academy, we strive to make data-informed decisions. That’s why we decided to invest in building a stack that can help us be as data-driven as possible.
Business users constantly leverage our data in order to monitor platform performance, build and extract reports, and see how our customers use the Cloud Academy. We also provide data to our enterprise customers, which lets them keep track of their platform usage.
Before modeling and using data, we need to extract useful data in order to create valuable information. We have two primary data sources:
The first step of modeling and using data begins with extracting it from different sources and putting it in a library where it can be assessed: the Data Warehouse.
Once information’s stored in this analytical database, we can perform queries and retrieve helpful information for our internal users and customers alike.
The Data Warehouse is logically split into two main parts:
To extract data from sources, our team built data transformation pipelines. We use the programming language Python to create pipeline logic, then we use Prefect to orchestrate the pipelines.
In some cases, raw data is extracted and placed in the staging area; in others, a few transformations need to be performed. This process produces pre-processed data.
As soon as raw and pre-processed data are in the staging area of the Data Warehouse, we can apply data transformations and data modeling. To do so, we use dbt (data build tool), a powerful resource that allows engineers to work and model data in the Data Warehouse.
dbt lets us declare desired transformations by defining SQL scripts, known as dbt models. Each model represents a new step in the transformation of data from the staging area to the curated area.
While performing the transformations, we consider a few different model categories:
Data quality represents consistency and how well data is structured for solving problems or to serve specific purposes.
dbt is a great tool for this because it also enables us to perform tests on both source data and the data we produce. There are different native tests available in order to test the data (you can find available tests here), which can easily be used and integrated where you define data structure. Of course, you can also define custom tests if none of the available ones fit your testing scenario.
Here’s an example of dbt test usage:
By leveraging dbt tests, we ensure the data we provide to our business users always have quality and consistency checks.
After the data extraction and transformation process is completed, data is available in the curated area of the Data Warehouse.
Business users can leverage this data to get reports and insights. The most common scenario involves leveraging data through a Business Intelligence tool to build dashboards. In some situations, reports need to be extracted, so we provide them by performing custom queries on the Data Warehouse.