This course explains how to optimize dedicated SQL pools in Azure Synapse Analytics using different data distribution methods, partitioning, and query optimization. It also gives a quick overview of the security and disaster recovery features of dedicated SQL pools.
Learning Objectives
- Know when to use round-robin, hash-distributed, and replicated distributions
- Know when to partition a table
- Optimize queries using statistics, caching, and log investigation
- Understand the security and disaster recovery features of dedicated SQL pools
Intended Audience
- Anyone who would like to learn advanced features of dedicated SQL pools in Azure Synapse Analytics
Prerequisites
- Some knowledge of Azure Synapse Analytics (take Introduction to Azure Synapse Analytics if you’re not familiar with this Microsoft service)
- Experience with SQL (not mandatory)
Before we get into specific optimization techniques, let’s go over some basics of table types and schemas in a data warehouse. If you have any background in relational databases, then you’ll be familiar with tables, which are organized in rows and columns. A table’s schema defines what type of data is in each column. A database’s schema defines the relationship between the tables in it. A data warehouse normally has multiple schemas because it contains data from multiple databases.
In a data warehouse, the simplest type of schema is called a star schema. It’s called that because it looks a little bit like a star. It has a fact table in the middle and dimension tables around it. A fact table usually contains either individual events or summaries. For example, a sales database might have a fact table with either one record for each sales transaction or the total value of all sales transactions in a given period of time.
Dimension tables give more details about the items in the fact table. For example, if each sales transaction in the fact table contains a product ID, then there would be a product dimension table that describes that particular product in more detail, including things like brand, model number, and product category.
One type of table that isn’t part of a star schema is called an integration table. This is generally a table that sits in between a data source and the data warehouse. For example, a common practice when loading data into a dedicated SQL pool is to first load the data into a staging table, perform some transformations on that data, and then load it into the SQL pool.
The fastest way to load data into Synapse Analytics is to use Polybase. It lets you read data from external sources using T-SQL. There are quite a few steps involved, though. Here’s what you need to do:
- Convert your data into structured text files, such as CSV or Parquet, and put the files in either Blob storage or Data Lake Storage.
- Create external tables by using these three T-SQL commands in this order:
CREATE EXTERNAL DATA SOURCE,
CREATE EXTERNAL FILE FORMAT, and
CREATE EXTERNAL TABLE
- Load the data into a staging table in Synapse Analytics. This is a best practice so you can deal with data loading issues without affecting production tables.
- Insert the data into production tables.
When you’re loading data into staging tables, you should use a round-robin distribution method. This’ll require a bit of explanation.
Tables in Synapse Analytics are actually spread out across 60 data distributions. This is why queries are so fast on this service--they’re massively parallelized. When you run a query, it spawns 60 queries that each run on one data distribution.
To make this work efficiently, you have to decide how the data will be distributed. This is also known as sharding. Synapse Analytics offers three choices: round-robin, hash-distributed, and replicated. A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.
Hash-distributed tables are a bit more complicated. The idea is that you designate one of the columns as the hash key. Then the hash function uses the value in this column to determine which data distribution to store a particular row on. As long as you choose a hash key that’s appropriate for the most commonly run queries on this table, then query performance will be much better than it would be with a round-robin table.
Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics’ massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.
Here are some characteristics of a good distribution column:
- It has many unique values so the rows will be spread out over the 60 distributions.
- It’s frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don’t have frequent joins, then choose a column that’s often in GROUP BY clauses.
- It’s not used in WHERE clauses, as this would limit query matches to only a few distributions.
In contrast to a hash-distributed table, a replicated table is really simple. The entire table gets stored on each of the 60 data distributions. That seems to defeat the purpose of dividing data into 60 distributions, doesn’t it, so why would you ever use it? Well, if a relatively small dimension table is frequently used in joins and aggregations, then it will be much more efficient to have it on every distribution.
To summarize, here are the recommended distribution methods for the three types of tables. Fact tables should be hash-distributed. Dimension tables should be replicated if they’re small or hash-distributed if they’re large. Staging tables should use round-robin distribution.
In addition to distributing a table, you can also partition it by date range. For example, you could partition your sales data by month. That is, each month’s data would be in a separate partition. The biggest benefit of doing this is that you could use something called partition switching.
Let’s say you only keep five years’ worth of sales data in your SQL pool, and at the end of every month, you add the data for the latest month and delete the data for the oldest month. If you have a huge number of rows to add and remove, then it can take a long time and could potentially run into problems during the process. With partition switching, you can load the latest month’s data into a temporary table, and then in the production table, replace the old partition with the new one.
And that’s it for distribution and partitioning.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).