Handling Slowly Changing Dimensions With Azure Synapse Analytics Pipelines
The course is part of this learning path
In this course, we’ll explain the concept of slowly changing dimensions and the different approaches to dealing with them. Then we’ll show you how to create a data flow in Azure Synapse Pipelines to update slowly changing dimensions. Even if you’re not particularly interested in how to handle slowly changing dimensions, the demos in this course will give you a good idea of how to create data flows in either Azure Synapse Pipelines or Azure Data Factory since they both use the same interface.
- Understand various methods for handling slowly changing dimensions in databases
- Create a data flow in Azure Synapse Pipelines that updates slowly changing dimensions in a database table
- Anyone who would like to learn about slowly changing dimensions in databases
- Anyone who needs to know how to create data flows in Azure Synapse Analytics Pipelines and Azure Data Factory
- Some knowledge of Azure Synapse Analytics, specifically dedicated SQL pools and Synapse Pipelines (take Introduction to Azure Synapse Analytics if you’re not familiar with this Microsoft service)
- Experience with SQL (not mandatory)
The GitHub repo for this course can be found here.
Now I’m going to walk you through an example from Microsoft’s website. In this example, we’re going to create a data flow in Synapse Analytics for a Type 1 SCD. This means we’ll only be updating fields with the latest values and won’t be keeping a history of previous values. We’ll actually be doing a bit more than is required for a Type 1 SCD, but I’ll tell you more about that in a minute.
We’re going to use a dedicated SQL pool as both the source and the destination. In real life, you’d probably use a different source, such as Azure SQL Database, but this way the demo will be simpler. I should also point out that you could create this data flow in Azure Data Factory instead of Synapse Analytics because they share the same interface, but since we’re using a dedicated SQL pool for the data, it’s nice to be able to do everything in Synapse Analytics instead of having to use two different Azure services.
Here’s an overview of the steps we’ll be performing:
- First, we’ll need to create a source table and a destination table.
- Then, we’ll create a data flow, which provides a graphical way to define data transformations without writing any code.
- Next, we’ll add the source table and destination table to the data flow.
- Finally, we’ll add the transformations we need to the data flow.
Here’s an overview of the transformations we’re going to create.
First, for each row in the source stream, we’ll create a hash of all of the data fields in the row. We need this hash key so we can easily see if the row has any new data that doesn’t exist in the destination table. In the next step, for each source row, we’ll check if any of the rows in the destination table have the same hash key. If the same hash key exists, then this row doesn’t have any new data, and we can discard it.
Next, for each row in the source stream, we’ll use the CustomerID to lookup the matching row in the destination table and then add the columns from the destination table to the row. This means we’ll have lots of duplicate columns because every column in the source table is also in the destination table. That’s okay, though, because we’re going to filter out all of the duplicate columns in a later step. So why do we even need to do this step? Because we need to add the InsertedDate and ModifiedDate columns from the destination table.
In the next step, we fill in the InsertedDate and ModifiedDate fields. If a row in the source stream has a CustomerID that isn’t found in the destination table, then it’s a new record, so we’ll put the current date and time in the InsertedDate field and the ModifiedDate field. As I mentioned earlier, this isn’t necessary for a Type 1 SCD, but it would be helpful if we were doing a Type 3 SCD. So even though we’re not going to do a full Type 3, this will give you an idea of how to do these particular aspects of a Type 3 implementation.
If a row in the destination table has the same CustomerID but a different hash key, then at least one of the data fields has a different value, so we’ll need to update the row. When we do that, we’re also going to put the current date and time in the ModifiedTime field. We won’t fill in the InsertedDate field, though, because that will already contain the date and time when this record was first added to the table.
Next, we have to tell it to allow upserts to the destination table, which means that if a row doesn’t already exist, then it will be added, but if it does already exist, then it will be updated.
In the final step, we’ll write the data to the destination table.
Okay, to see how to create this entire data flow, go to the next video.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).