Intro and Synapse Analytics
The course is part of this learning path
This course explains some additional topics you should make sure you understand before taking the DP-201 exam, including:
- Synapse Analytics
- Table Storage
If you're preparing to write Microsoft's DP-201 exam, then there are some additional topics you should make sure you understand first. These topics deal with three services. Synapse Analytics, SQL Database, and Table Storage.
Let's start with Azure Synapse Analytics (formerly known as SQL Data Warehouse).
When you're loading data into staging tables, you should use a round-robin distribution method. This'll require a bit of explanation.
Tables in Synapse Analytics are actually spread out across 60 data distributions. This is why queries are so fast on this service—they're massively parallelized. When you run a query, it spawns 60 queries that each run on one data distribution.
To make this work efficiently, you have to decide how the data will be distributed. This is also known as sharding. Synapse Analytics offers three choices: round-robin, hash-distributed, and replicated. A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it's the fastest distribution type for loading data into a staging table. It doesn't perform any optimization.
Hash-distributed tables are a bit more complicated. The idea is that you designate one of the columns as the hash key. Then the hash function uses the value in this column to determine which data distribution to store a particular row on. As long as you choose a hash key that's appropriate for the most commonly run queries on this table, then query performance will be much better than it would be with a round-robin table.
Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics' massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.
Here are some characteristics of a good distribution column:
- It has many unique values so the rows will be spread out over the 60 distributions.
- It's frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don't have frequent joins, then choose a column that's often in GROUP BY clauses.
- It's not used in WHERE clauses, as this would limit query matches to only a few distributions.
In contrast to a hash-distributed table, a replicated table is really simple. The entire table gets stored on each of the 60 data distributions. That seems to defeat the purpose of dividing data into 60 distributions, doesn't it, so why would you ever use it? Well, if a relatively small dimension table is frequently used in joins and aggregations, then it will be much more efficient to have it on every distribution.
Okay, the next topic is disaster recovery. Azure takes snapshots of your Synapse Analytics data warehouse throughout the day. Then if you need to restore your data warehouse to a previous state, you can simply choose which of the automatic restore points you want to go back to.
In some cases, you may want to create your own user-defined restore points in addition to the automatic restore points. For example, you may want to take a snapshot before you make a large number of changes to your data.
Both the automatic and user-defined restore points are stored in the primary region where your data warehouse runs, so if the primary region is down, you won't be able to access any of those restore points. To ensure that you will still be able to bring up your data warehouse in another region, Azure makes a geo-redundant backup once a day to a paired data center. If disaster strikes, you can use that backup to restore a copy of your data warehouse to any region that supports Synapse Analytics.
And that's it for Synapse Analytics.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).