The course is part of these learning paths
This course covers how to load your data efficiently using the COPY command in Amazon Redshift. You'll learn how it can be used to load multiple or a single large file into Amazon Redshift, as well as the different features that the COPY command has to offer and what data upserting is.
Learning Objectives
- Understand the essentials of the COPY command in Redshift and its features
- Learn how to load single or multiple large files into Redshift
- Learn what upserting is and why it's important
Intended Audience
This course is intended for people that want to learn more about how to load their data efficiently using the COPY command.
Prerequisites
To get the most from this course, you should have a basic understanding of Amazon Redshift and Data Analytics concepts.
By using the COPY command, Redshift will automatically engage its built-in parallelism. Additionally--and this might seem counterintuitive--only use one COPY command even when working with multiple files or sources. I cannot stress this point enough.
Amazon Redshift automatically parallelizes data ingestion when using a single COPY command with multiple sources. Sources include both files and data streams. If, for example, I had three files in three different S3 buckets, using a single COPY command lets the leader node manage the data distribution between the compute nodes and available slices.
Basically, by using COPY to bulk load data into a table, Redshift optimizes the cluster's resources and produces the fastest possible throughput. The data is spread equally between the cluster's compute nodes. This includes sorting the rows and distributing data across node slices.
Sorting and vacuuming Redshift tables is outside the scope of this course. However, it's worth mentioning that loading data in sort key order reduces the need to vacuum. The vacuum process helps to property redistribute data across shards and reclaim space. Please consult the documentation for more information.
Amazon Redshift is designed to store and query large datasets. Large is a relative term that has multiple meanings based on context. For Redshift, large can mean datasets that can be measured in petabytes. A petabyte is 1,024 terabytes. For perspective, the maximum capacity of Blu Ray disks is 50 gigabytes. To store a petabyte, you'd need over 83 thousand of them.
Luckily, Amazon S3 can manage such large volumes of data. It is used to collect and stage data from multiple sources before running a COPY command. There are ways to make loading data efficient and reduce errors. If datasets span multiple files, use a manifest file to ensure that Redshift loads only the data required and that the data is loaded only once. A manifest can also be used to load files that have different prefixes or are from different buckets.
Consider this COPY command example. It has the three required elements; the table name, the data source, and the authorization method. The first line specifies the destination. This is the table that will store the data. The second line is the source. It is where the data currently exists. The third line specifies the IAM role to use when making loading the data.
If anyone accidentally puts a file in that sales bucket, the COPY command will try to load it. Outside of corrupting a production database, what could go wrong?
Likewise, if someone fails to put a file in that S3 bucket, the command will run but the load will be incomplete. Worse than that, you won't know that the data is missing.
If you are trying to avoid a Resume Building Event, using a manifest file ensures that only the required data is loaded and all of it is accounted for. Here's the same example using a manifest file. The FROM statement specifies the manifest file and the last line tells the COPY command that there's a manifest file. If you forget that last line, Redshift will treat that S3 bucket as the data source and, obviously, the copy will fail.
Here's an example of a manifest file. It's written in JSON and loads data from three different buckets. The mandatory flag indicates whether or not the COPY command should terminate if the file does NOT exist.
Unwanted objects in these buckets are ignored because they're not in the manifest file. To engage the parallelism of Redshift, it is better to load data from multiple files instead of a single large one.
If the data is in a single file, Redshift will be forced to perform serialized load. Split the files into ones that are equal in size. For optimal performance, the ideal file size is between 1 and 125 megabytes after compression. Additionally, to optimize the performance of the data load, consider the number of files being loaded. The total number of files should be a multiple of the number of slices in the cluster. This ensures that the load is balanced across the entire cluster.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.