How Do We Transform and Model Data at Cloud Academy?
“Data is the new gold”: a common phrase over the last few years. For all organizations, data and information have become crucial to making good decisions for the future and having a clear understanding of how they’re making progress — or otherwise.
At Cloud Academy, we strive to make data-informed decisions. That’s why we decided to invest in building a stack that can help us be as data-driven as possible.
Business users constantly leverage our data in order to monitor platform performance, build and extract reports, and see how our customers use the Cloud Academy. We also provide data to our enterprise customers, which lets them keep track of their platform usage.
Where do we get data?
Before modeling and using data, we need to extract useful data in order to create valuable information. We have two primary data sources:
- Internal: Provided by our operational systems that expose the core data of the Cloud Academy platform, like labs and course sessions.
- External: Provided by external services and platforms that we use to collect data about events not strictly related to the Cloud Academy platform, like currency exchange rates.
The data extraction process
The first step of modeling and using data begins with extracting it from different sources and putting it in a library where it can be assessed: the Data Warehouse.
Once information’s stored in this analytical database, we can perform queries and retrieve helpful information for our internal users and customers alike.
The Data Warehouse is logically split into two main parts:
- Staging Area: This is the area where raw (data extracted from sources as-is) and pre-processed (data extracted from sources that is then processed, such as applying common formats) data are stored. This data is not in the final form, so it is not used by the final users.
- Curated Area: This is the area where the transformed and modeled data is stored. We model data by using the dimensional modeling technique following the Kimball approach. This data is pulled directly by end users through SQL queries or through Business Intelligence tools.
To extract data from sources, our team built data transformation pipelines. We use the programming language Python to create pipeline logic, then we use Prefect to orchestrate the pipelines.
In some cases, raw data is extracted and placed in the staging area; in others, a few transformations need to be performed. This process produces pre-processed data.
What’s next in the data transformation process?
As soon as raw and pre-processed data are in the staging area of the Data Warehouse, we can apply data transformations and data modeling. To do so, we use dbt (data build tool), a powerful resource that allows engineers to work and model data in the Data Warehouse.
dbt lets us declare desired transformations by defining SQL scripts, known as dbt models. Each model represents a new step in the transformation of data from the staging area to the curated area.
How are dbt models organized?
While performing the transformations, we consider a few different model categories:
- Staging Models: Initial models that represent raw and pre-processed data extracted through Prefect pipelines.
- Intermediate Models: Models that take data from staging models or from intermediate models (if multiple levels are defined).
- Marts Models: Models that take data from the intermediate models and that represent the tables in the curated area. The marts models are usually dimensions (containing partially denormalized data about entities) and facts (containing normalized data about happened events).
Data quality with dbt
Data quality represents consistency and how well data is structured for solving problems or to serve specific purposes.
dbt is a great tool for this because it also enables us to perform tests on both source data and the data we produce. There are different native tests available in order to test the data (you can find available tests here), which can easily be used and integrated where you define data structure. Of course, you can also define custom tests if none of the available ones fit your testing scenario.
Here’s an example of dbt test usage:
By leveraging dbt tests, we ensure the data we provide to our business users always have quality and consistency checks.
Finally, using data!
After the data extraction and transformation process is completed, data is available in the curated area of the Data Warehouse.
Business users can leverage this data to get reports and insights. The most common scenario involves leveraging data through a Business Intelligence tool to build dashboards. In some situations, reports need to be extracted, so we provide them by performing custom queries on the Data Warehouse.