Handling Slowly Changing Dimensions With Azure Synapse Analytics Pipelines
The course is part of these learning paths
In this course, we’ll explain the concept of slowly changing dimensions and the different approaches to dealing with them. Then we’ll show you how to create a data flow in Azure Synapse Pipelines to update slowly changing dimensions. Even if you’re not particularly interested in how to handle slowly changing dimensions, the demos in this course will give you a good idea of how to create data flows in either Azure Synapse Pipelines or Azure Data Factory since they both use the same interface.
- Understand various methods for handling slowly changing dimensions in databases
- Create a data flow in Azure Synapse Pipelines that updates slowly changing dimensions in a database table
- Anyone who would like to learn about slowly changing dimensions in databases
- Anyone who needs to know how to create data flows in Azure Synapse Analytics Pipelines and Azure Data Factory
- Some knowledge of Azure Synapse Analytics, specifically dedicated SQL pools and Synapse Pipelines (take Introduction to Azure Synapse Analytics if you’re not familiar with this Microsoft service)
- Experience with SQL (not mandatory)
The GitHub repo for this course can be found here.
Let’s do a quick review of what you’ve learned.
A slowly changing dimension is a column in a dimension table where the values change infrequently, such as an address or phone number.
There are several ways of handling this situation. Type 1 SCD means that the old values are not saved, so the database only shows the latest values.
With Type 2 SCD, you store a new copy of a record whenever one of its fields changes. To make this work, you have to add StartDate, EndDate, and IsCurrent columns. You also need to create a surrogate key that’s unique for each version of a record.
With Type 3 SCD, instead of creating a new record when a field changes, you have two columns for a field: the old version and the current version. You can either store the original value or the previous value in the old column. You’ll also need a column that says when the value was last modified.
Type 6 is a combination of Type 2 and Type 3. It adds a new record whenever the field changes, and it also has extra columns to keep track of the previous version of the field. With this method, you can get the current version and one of the previous versions by retrieving a single record instead of two records.
In Azure Synapse Pipelines or Azure Data Factory, a data flow provides a graphical way to define data transformations without writing any code.
If you enable “Data flow debug”, it’ll start a Spark cluster that you can use for testing.
Schema drift is when the source dataset’s schema changes after you create the data flow. When you enable “Allow schema drift”, it will read in all of the column names every time the data flow is executed and allow all of the new or changed columns to pass through the data flow.
Some of the transformations you can add to a data flow include:
- Derived Column, which lets you create or modify a column,
- Exists, which checks whether specific data exists or doesn’t exist in a data stream,
- Lookup, which adds columns from a second stream to a data stream,
- Alter Row, which lets you set a policy regarding which data updates are allowed, and
- Sink, which specifies a destination for writing data.
If you’re writing data to a dedicated SQL pool, selecting the “Enable staging” option can improve performance.
To run a data flow, you need to create a pipeline containing the data flow and run the pipeline. Before you can run a pipeline, you need to publish everything related to it, including datasets, scripts, the data flow, and the pipeline itself. Then you can run the pipeline by creating a trigger. You can watch a pipeline run in the Monitor section of the interface.
Please give this course a rating, and if you have any questions or comments, please let us know. Thanks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).