Handling Slowly Changing Dimensions With Azure Synapse Analytics Pipelines
The course is part of this learning path
In this course, we’ll explain the concept of slowly changing dimensions and the different approaches to dealing with them. Then we’ll show you how to create a data flow in Azure Synapse Pipelines to update slowly changing dimensions. Even if you’re not particularly interested in how to handle slowly changing dimensions, the demos in this course will give you a good idea of how to create data flows in either Azure Synapse Pipelines or Azure Data Factory since they both use the same interface.
- Understand various methods for handling slowly changing dimensions in databases
- Create a data flow in Azure Synapse Pipelines that updates slowly changing dimensions in a database table
- Anyone who would like to learn about slowly changing dimensions in databases
- Anyone who needs to know how to create data flows in Azure Synapse Analytics Pipelines and Azure Data Factory
- Some knowledge of Azure Synapse Analytics, specifically dedicated SQL pools and Synapse Pipelines (take Introduction to Azure Synapse Analytics if you’re not familiar with this Microsoft service)
- Experience with SQL (not mandatory)
The GitHub repo for this course can be found here.
Suppose you have a customer database, and the values in certain columns can change occasionally. For example, if a customer moved to a new location, then their address would change. This is sometimes referred to as a slowly changing dimension. Once a value has been changed, you wouldn’t normally be able to see the previous value for that field unless you resorted to looking at a database backup.
There are several ways of handling this situation, and it’s important enough that there are names for the different methods. Type 1 SCD means that the old values are not saved, so the database only shows the latest values.
If you do want to keep track of the different versions of the data, the most common way to do it is called Type 2 SCD. With this method, you store a new copy of a record whenever one of its fields changes. To make this work, you have to add some extra columns. The StartDate and EndDate columns say when this particular version of the record was valid. The IsCurrent column says whether this is the latest version of the record.
One issue is that you need to have a different key for each of the versions of a record. But since they all have the same key, such as customer ID, you need to come up with a new one called a surrogate key that’s unique for each version.
Then there’s Type 3 SCD. Instead of creating an entirely new record when a field changes, you have two columns for a field: the old version and the current version. You can either store the original value or the previous value in the old column. You’ll also need a column that says when the value was last modified. One problem with this method is that you only have two versions of a field’s value in the database, so you’re not going to have all of the changes. Another problem is that it’s only suitable for one or two fields. You wouldn’t want to have multiple columns for every field or you’d end up with a huge number of columns.
Type 6 SCD is a combination of Type 2 and Type 3. It adds a new record whenever the field changes, and it also has extra columns to keep track of the previous version of the field. With this method, you can get the current version and one of the previous versions by retrieving a single record instead of two records, so if that’s important to you, then it might be worth all of the extra rows and columns.
There are a few other types as well, such as Type 4 SCD, but the ones I’ve covered here should give you a good understanding of how to handle slowly changing dimensions. And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).