Intro and Synapse Analytics
The course is part of this learning path
This course explains some additional topics you should make sure you understand before taking the DP-200 Exam. These topics deal with four different services:
- SQL Database
- Synapse Analytics
- Data Factory
If you're preparing to write Microsoft's DP-200 exam, then there are some additional topics you should make sure you understand first. These topics deal with four different services: SQL Database, Synapse Analytics, Databricks, and Data Factory. Let's start with Azure Synapse Analytics, formerly known as SQL Data Warehouse.
The fastest way to load data into Synapse Analytics is to use Polybase. It lets you read data from external sources using T-SQL. There are quite a few steps involved, though. Here's what you need to do. Convert your data into structured text files, such as CSV or Parquet, and put the files in either Blob storage or Data Lake Storage. Create external tables by using these three T-SQL commands in this order: CREATE EXTERNAL DATA SOURCE, CREATE EXTERNAL FILE FORMAT, and CREATE EXTERNAL TABLE.
Load the data into a staging table in Synapse Analytics. This is a best practice so you can deal with data loading issues without affecting production tables. Insert the data into production tables. When you're loading data into staging tables, you should use a round-robin distribution method. This'll require a bit of explanation. Tables in Synapse Analytics are actually spread out across 60 data distributions. This is why queries are so fast on this service, they're massively parallelized. When you run a query, it spawns 60 queries that each run on one data distribution. To make this work efficiently, you have to decide how the data will be distributed. This is also known as sharding.
Synapse Analytics offers three choices, round-robin, hash-distributed, and replicated. A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it's the fastest distribution type for loading data into a staging table; it doesn't perform any optimization.
Hash-distributed tables are a bit more complicated. The idea is that you designate one of the columns as the hash key. Then the hash function uses the value in this column to determine which data distribution to store a particular row on. As long as you choose a hash key that's appropriate for the most commonly run queries on this table, then query performance will be much better than it would be with a round-robin table.
Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics' massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.
Here are some characteristics of a good distribution column. It has many unique values so the rows will be spread out over the 60 distributions. It's frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don't have frequent joins, then choose a column that's often in GROUP BY clauses. It's not used in WHERE clauses, as this would limit query matches to only a few distributions.
In contrast to a hash-distributed table, a replicated table is really simple. The entire table gets stored on each of the 60 data distributions. That seems to defeat the purpose of dividing data into 60 distributions, doesn't it? So why would you ever use it? Well, if a relatively small dimension table is frequently used in joins and aggregations, then it will be much more efficient to have it on every distribution. And that's it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).