Amazon Redshift is a cloud-native data warehouse from AWS. It has a Massively Parallel Processing framework that automatically distributes data and the query load across every node available in a cluster. This course explains how Redshift distributes table data, how keys are used inside tables, and the importance of distribution styles.
- Understand the key concepts of data distribution
- Learn about the three types of distribution styles
- Understand the difference between distribution keys and sort keys
This course is intended for database administrators or anyone who wants to enhance their knowledge of Amazon Redshift.
To get the most from this course, you should have a basic understanding of Amazon Redshift.
Redshift tables can be spread across multiple compute nodes. One way to improve query performance is to define the appropriate distribution style for each table.
Amazon Redshift has three types of distribution styles for tables: EVEN, KEY, and ALL. Technically, there's a fourth. However, the fourth one requires understanding the first three.
What, then, is a distribution style?
In relational databases, table reads can be made more performant by defining and using an index.
Think of it like an index in a book. If you know what page a subject is on, you can turn to it quickly without having to search.
While an index is like having another table in a database, Amazon Redshift takes a more physical approach to the problem of improving queries.
Instead of having an index a the beginning of a book, each subject becomes its own book.
This is achieved by using a distribution style. The distribution styles along with an optional KEY value, tell Redshift how to spread the tables between the compute nodes.
In this type of distribution, the leader node distributes data to the node slices in a round-robin fashion.
This makes for the most even distribution across the Redshift nodes.
It is appropriate when the table is not going to be joined with any other tables.
The EVEN distribution is good for tables that have a large number of unique values. If every row is different there is no logical way to distribute the table.
As far as a use case, tables that are going to be connected to Business Intelligence tools like Tableau--without any type of filtering--are well suited to this distribution style.
With a KEY distribution, rows are distributed according to the values in a single column.
Redshift stores rows with the same key on the same node slice. The goal is to keep similar data close to improve performance.
If there is are other tables that share the same key, Redshift will attempt to put these tables on the same slice.
This is called colocation and it can drastically improve performance. The tables will need little--if any--redistribution. Not only does this save time, it reduces the amount of network bandwidth consumed. Bandwidth that could be used by other processes and queries.
However, depending on the size of the Redshift cluster and the amount of available storage it might not be possible to save tables that have the same KEY on the same slice.
If colocation is not possible, rows with the same values are distributed to other nodes in round-robin style.
KEY distribution is best when a table has multiple rows with a single identifier and this key value will be used for queries and joins.
If a KEY is heavily skewed--where one KEY value has a thousand rows but another value has a million--do not use the KEY distribution.
This is also true if the KEY has blank or NULL values. This would put these rows on the same node and cause a bad skew.
Remember, the Leader node returns the results of a query after all of the nodes have returned their data. The slice with the largest workload becomes a bottleneck.
Finally, before choosing to use a KEY distribution style, consider how the table is going to be used. If the table will not be used with queries that reference the distribution KEY or with table joins, then choose another distribution style.
When a table has a distribution style of ALL, the leader node copies the table to every available node in the cluster.
The ALL distribution ensures that every row is colocated for every join operation involving the table.
Essentially, using a distribution style of ALL copies the entire table to the first slice in every node.
In a Redshift cluster with four nodes, a table would be copied four times and, somewhat obviously, this means it takes up four times as much space.
Having a copy of the data on every node means that queries are fast but this speed comes at a cost. Loading, updating, deleting, and inserting data in these tables takes more time.
ALL distribution is appropriate only for static or relatively slow-moving tables that are used in joins or as lookup tables.
In general, because the cost of redistributing small tables during a query is low, there is no significant benefit to using a distribution style of ALL for them. A small table is less than 500 kilobytes.
Besides small tables, there are a couple of use cases where the ALL distribution should be avoided.
Do not use the ALL distribution for:
Tables that will be used once.
Tables are updated frequently.
Tables have 10 million or more rows.
Thank about that last one.
Do you really want to explain to anyone why a 10 million row table has been copied to every node?
Earlier, I said there are three distribution types; EVEN, KEY, and ALL. I added that there is--technically--a fourth type.
Though, it is less a fourth type and more like a magician doing a card trick.
Spoiler alert: Magic tricks aren't real. The playing cards--and other props--have to follow the same laws of physics that the rest of the world does.
The same is true of this fourth distribution type. It is called AUTO and is really a combination of the distribution types ALL and EVEN.
In 2019, AWS updated Amazon Redshift to make it possible for Redshift to automatically assign an optimal distribution style based on the size of the table data.
This feature was designed to provide better query performance and storage space utilization across nodes without needing to be configured.
When using the AUTO distribution, Redshift will initially assign ALL distribution to a new table.
When the table grows past a threshold, it automatically changes to an EVEN distribution.
However, it cannot do a KEY distribution.
When changing from ALL to EVEN distribution, storage utilization might be changed slightly. This distribution happens in the background and does not affect the performance queries or transactions taking place on the table.
If you are unsure of whether or not to use an ALL or EVEN distribution and do not need KEY, set the distribution style to AUTO or simply omit the distribution style. When omitted, the distribution style is the default.
This means that the AUTO distribution style is not really a distribution at all.
It's a bit of magic from AWS--or sleight-of-hand if you prefer--to help people manage key distributions. It's distribution styles for people that don't like figure out distribution styles.
As a final thought about the AUTO distribution, do not use it if you intend to use the table to do joins.
If you are going to do joins, you need a KEY distribution.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.