Amazon Redshift is a cloud-native data warehouse from AWS. It has a Massively Parallel Processing framework that automatically distributes data and the query load across every node available in a cluster. This course explains how Redshift distributes table data, how keys are used inside tables, and the importance of distribution styles.
Learning Objectives
- Understand the key concepts of data distribution
- Learn about the three types of distribution styles
- Understand the difference between distribution keys and sort keys
Intended Audience
This course is intended for database administrators or anyone who wants to enhance their knowledge of Amazon Redshift.
Prerequisites
To get the most from this course, you should have a basic understanding of Amazon Redshift.
An Amazon Redshift cluster is a set of one or more compute nodes.
Each node in the cluster has its own operating system, dedicated memory, and dedicated disk storage.
If there are two or more compute nodes in the cluster, Redshift creates a Leader node. This node manages the distribution of data and query processing tasks to the compute nodes. It also manages communication between client applications and the compute nodes.
Client applications only interact with the leader node. It is impossible to communicate, directly, with the compute nodes.
The compute nodes provide resources to execute queries. Once complete, they send the results back to the leader node. The leader node takes the responses from the compute nodes, combines them as needed, and returns the results to the client that made the request.
The disk storage for a compute node is divided into slices. The number of slices per node depends on the node size of the cluster.
When data is loaded into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table's distribution style.
Before running a query, the Redshift creates a query plan on the leader node. This plan interprets the query to determine an estimate of the time and resources needed for processing. It also determines where the blocks of data need to be in order to efficiently run the query.
Data is then physically moved, or redistributed, while the query runs.
Redistribution can involve different tasks. It could send specific rows to nodes for joining or it could broadcast an entire table to all of the nodes.
Be aware that data redistribution can account for a substantial portion of the cost of a query.
The network traffic redistribution generates can affect other database operations and slow overall system performance.
When creating tables, it's important to minimize the need to redistribute data for queries.
When data is loaded into a table, Redshift automatically distributes the table's rows to the compute nodes and slices according to the distribution style that was chosen when the table was created.
This distribution style has two primary goals.
One goal is to ensure workloads run uniformly among the nodes in the cluster.
Uneven distribution--also called data distribution skew--forces some nodes to do more work than others and impacts query performance.
The other goal is to minimize data movement.
If the rows that participate in joins or aggregates are already on the same nodes as the joining rows in other tables, the optimizer doesn't need to redistribute much data.
The distribution strategy has important consequences for query performance, storage requirements, data loading, and maintenance.
By choosing the best distribution style for each table, data distribution will be balanced and this will significantly improve overall system performance.
Determining the amount of data skew is outside the scope of this course.
However, to view the distribution style of a table, query the PG_CLASS_INFO view or the SVV_TABLE_INFO view.
The output will vary as it depends on the table structure. However, using that sample query, the results could look like this.
The default schema in Redshift is named PUBLIC. It has read and write access for all users.
While schemas are out of the scope of this course, be aware that it is possible to have more than one and that users should have access to only those schemas they need.
Consult the Redshift documentation for more information.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.