Optimizing Dedicated SQL Pools in Azure Synapse Analytics
The course is part of these learning paths
This course explains how to optimize dedicated SQL pools in Azure Synapse Analytics using different data distribution methods, partitioning, and query optimization. It also gives a quick overview of the security and disaster recovery features of dedicated SQL pools.
- Know when to use round-robin, hash-distributed, and replicated distributions
- Know when to partition a table
- Optimize queries using statistics, caching, and log investigation
- Understand the security and disaster recovery features of dedicated SQL pools
- Anyone who would like to learn advanced features of dedicated SQL pools in Azure Synapse Analytics
- Some knowledge of Azure Synapse Analytics (take Introduction to Azure Synapse Analytics if you’re not familiar with this Microsoft service)
- Experience with SQL (not mandatory)
Since data warehouses contain massive amounts of data, query execution needs to be optimized to reduce both the time queries take and the resources they consume. Before running a query, the SQL pool’s query optimizer decides which query plan to use. For example, it will estimate how many rows will be returned by the query, and it will use one query plan if it’s a small number of rows and a different query plan if it’s a large number of rows.
In order to make the right decision, it needs to know your data pretty well. The way it does that is by generating statistics about important columns, such as columns that are used to join two tables together. Fortunately, SQL pools generate statistics automatically by default. However, these statistics are only generated when the SQL pool sees that statistics it needs for an incoming query are missing. So, if the statistics aren’t already there, the query will be slower the first time it’s run because the statistics have to be generated. This is why it’s important to generate statistics ahead of time, if possible.
Also, even if statistics have already been created, they will become out-of-date as new data gets added. One approach is to update the statistics whenever new data is loaded. However, it would probably be excessive to update the statistics on every column, so you should focus on the ones that are important for query performance, such as date columns and columns that are used in JOIN, GROUP BY, ORDER BY, or DISTINCT clauses in SQL queries.
If a query takes much longer than you expect, then you can investigate it by querying the logs. You need to have the VIEW DATABASE STATE permission to do this. Even if you don’t know of a particular query that took a long time, you can query the logs to find the ones that took the longest time to execute.
Once you know which query you want to investigate, you can retrieve both the SQL statement that was used and the query plan that was executed. You can also find out how many rows came from each distribution. If a large number of rows came from one distribution, this could indicate that you have a hot spot. This is also known as data skew. To fix data skew, use a more appropriate column as the hash key so the rows are distributed more evenly.
One feature that dramatically speeds up queries is caching. If you enable result set caching, then the results from all queries (with a few exceptions) get cached. If you or someone else runs the same query again, and the data hasn’t changed in the meantime, then the results are retrieved from the cache instead of from executing the query. Not only does this make the query lightning-fast, but it’s also “free” because it doesn’t use one of your concurrency slots to execute the query.
The maximum size of the cache is one terabyte per database, which is a lot, but it would still fill up eventually if it wasn’t managed. When the cache is almost full, the SQL pool removes results that haven’t been retrieved recently or that are no longer valid due to data changes. Even if the cache is not close to being full, result sets that haven’t been retrieved for 48 hours are removed.
One last point is that if you’re planning on running queries that will return large amounts of data, then you should disable result set caching on the database before running them.
And that’s it for query optimization.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).