The course is part of this learning path
This course explains how to optimize dedicated SQL pools in Azure Synapse Analytics using different data distribution methods, partitioning, and query optimization. It also gives a quick overview of the security and disaster recovery features of dedicated SQL pools.
Learning Objectives
- Know when to use round-robin, hash-distributed, and replicated distributions
- Know when to partition a table
- Optimize queries using statistics, caching, and log investigation
- Understand the security and disaster recovery features of dedicated SQL pools
Intended Audience
- Anyone who would like to learn advanced features of dedicated SQL pools in Azure Synapse Analytics
Prerequisites
- Some knowledge of Azure Synapse Analytics (take Introduction to Azure Synapse Analytics if you’re not familiar with this Microsoft service)
- Experience with SQL (not mandatory)
Let’s do a quick review of what you learned.
In a data warehouse, the simplest type of schema is a star schema, which has a fact table in the middle and dimension tables around it. A fact table usually contains either individual events or summaries. Dimension tables give more details about the items in the fact table.
An integration table is a table that sits in between a data source and the data warehouse. One example is a staging table. Polybase lets you read data from external sources into a staging table using T-SQL. You can create an external table by using these T-SQL commands:
CREATE EXTERNAL DATA SOURCE,
CREATE EXTERNAL FILE FORMAT, and
CREATE EXTERNAL TABLE
Tables in dedicated SQL pools are spread out across 60 data distributions. There are three distribution methods. In a round-robin table, rows are distributed evenly across the data distributions. In a replicated table, the entire table gets stored on each of the 60 data distributions. It should only be used for small tables.
In a hash-distributed table, you designate one of the columns as the hash key. Then the hash function uses the value in this column to determine which data distribution to store a particular row on. Here are some characteristics of a good distribution column:
- It has many unique values.
- It’s frequently used in JOINs or GROUP BY clauses.
- And it’s not used in WHERE clauses.
The default type of index for tables in SQL pools is a clustered columnstore index, and this type of table should usually use the hash distribution method.
In, summary, fact tables should be hash-distributed, dimension tables should be replicated if they’re small or hash-distributed if they’re large. Staging tables should use round-robin distribution.
In addition to distributing a table, you can also partition it by date range. The biggest benefit of doing this is that you could use partition switching, which lets you replace an old partition with a new one very quickly.
Before running a query, the SQL pool’s query optimizer decides which query plan to use. In order to choose the right plan, it needs to have statistics about important columns. SQL pools generate statistics automatically by default when they see that statistics needed for an incoming query are missing.
If you update the statistics whenever new data is loaded, you should focus on the columns that are important for query performance, such as date columns and columns that are used in JOIN, GROUP BY, ORDER BY, or DISTINCT clauses in SQL queries.
You can investigate a query that takes a long time by retrieving its SQL statement, query plan, and the number of rows that came from each distribution. If a large number of rows came from one distribution, it might be caused by data skew. To fix data skew, use a more appropriate column as the hash key so the rows are distributed more evenly.
If you enable result set caching, then the results from most queries will be put in the cache, and they’ll be retrieved when someone runs the same query again, assuming the data hasn’t changed in the meantime.
Data Discovery & Classification scans your database looking for sensitive data and lets you classify it as confidential, after which you can use database auditing to monitor access to it. Dynamic Data Masking will obscure some of the information in a particular column when it’s retrieved in a query. The Vulnerability Assessment service scans your database looking for potential security issues. Advanced Threat Protection looks for unusual attempts to access or exploit databases.
Transparent Data Encryption is used to encrypt the entire database, including the log files and the backups.
Row-level security lets you restrict each user’s or group’s access only to the rows they should be able to see in a table. Column-level security lets you restrict access to specific columns.
If you need to restore a SQL pool to a previous state, you can choose one of the automatic restore points or one of your own user-defined restore points to go back to.
Azure makes a geo-redundant backup once a day to a paired data center in a different region from the one where your SQL pool resides.
To learn more about Azure Synapse Analytics, you can take another one of our courses or labs, or you can read Microsoft’s documentation at this URL.
Please give this course a rating, and if you have any questions or comments, please let us know. Thanks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).