Amazon Redshift Distribution Styles
The course is part of these learning paths
Amazon Redshift is a cloud-native data warehouse from AWS. It has a Massively Parallel Processing framework that automatically distributes data and the query load across every node available in a cluster. This course explains how Redshift distributes table data, how keys are used inside tables, and the importance of distribution styles.
- Understand the key concepts of data distribution
- Learn about the three types of distribution styles
- Understand the difference between distribution keys and sort keys
This course is intended for database administrators or anyone who wants to enhance their knowledge of Amazon Redshift.
To get the most from this course, you should have a basic understanding of Amazon Redshift.
The thing to remember is that the distribution style determines where data gets physically stored across the available nodes.
Those queries that reference where data is stored will be highly performant.
This is not a magic wand with Redshift, as there will always be queries that work against a distribution style.
The goal is that--through careful design and constant revision--most often, queries work with Redshift's distribution.
Here are some things to remember about distribution styles.
The most important consideration is to only pick a KEY distribution if it is going to be used for queries or joins.
You are limited to only one key and you need to be sure that queries benefit by using it.
To illustrate the impact of skew, consider this table containing product inventory. The product ID is the distribution key.
The leader node will not return its results until all of the compute nodes have completed their processing.
Nodes 3 and 4 will complete much faster than 1 and 2. Node 1 is the bottleneck.
The distribution style is about maximizing the efficiency of a query.
If you are unsure about what distribution style to use, you are not alone. It's probably why AWS introduced the distribution style of AUTO.
Remember that each node has its own CPU and storage.
In the best-case scenario, where data is evenly distributed in both volume and variety across the available nodes, Redshift will automatically divide the workload to process the query.
In a worst-case scenario, the same thing happens. Redshift automatically divides the workload to process the query.
My point is that sometimes it is not worth investing the time and effort needed to optimize a table for a single query.
However, in the long term, time, effort, and money can be saved using the appropriate distribution style or styles.
That's it for this course. For Cloud Academy, I'm Stephen Cole.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.