Loading Data into Amazon Redshift
The course is part of these learning paths
This course covers how to load your data efficiently using the COPY command in Amazon Redshift. You'll learn how it can be used to load multiple or a single large file into Amazon Redshift, as well as the different features that the COPY command has to offer and what data upserting is.
- Understand the essentials of the COPY command in Redshift and its features
- Learn how to load single or multiple large files into Redshift
- Learn what upserting is and why it's important
This course is intended for people that want to learn more about how to load their data efficiently using the COPY command.
To get the most from this course, you should have a basic understanding of Amazon Redshift and Data Analytics concepts.
Redshift's COPY command has a number of built-in features that make loading data more flexible:
- The source data can be in multiple formats from multiple data sources.
- It can control access to loading data.
- It will manage data transformations.
- It manages the load operation.
The formats include plain text files, compressed files, and ones that are encrypted. While the COPY command has some intelligence about these files, it only goes so far. For example, it's possible to specify what type of delimiter is used to separate data. Please consult the documentation for details about the various input formats.
The COPY command only requires three parameters: a table name, a data source, and authorization to access the data. When copying Linux or Windows files using the command line, you have to specify the source first and then the destination. The Redshift COPY command reverses this; it starts with the destination and is followed with the source. That table must already exist in the database.
The COPY command cannot dynamically add tables. Once copied, the data is appended to the existing rows. Earlier, I mentioned that the COPY command has some limited intelligence. By default, COPY inserts field values into the target table's columns in the same order that the fields occur in the data files. If the default column order will not work, it is possible to specify a column list to map source data fields to the target columns.
Though, when using DynamoDB, the column order does not matter. The COPY command matches attribute names from the DynamoDB table to column names in the Redshift table. Please see the documentation for more information.
The most common source for loading data into Amazon Redshift seems to be S3. Data can also be loaded from an Amazon EMR Cluster, DynamoDB, an EC2 instance, or remote hosts that are accessible using SSH. To copy from S3, the FROM parameter looks like this.
In this example, the COPY command loads all the files in the data folder inside mybucket. To load data from an Amazon EMR cluster the FROM parameter looks like this. Here, the Redshift table is loaded with TAB delimited data from lzop-compressed files. This is specified using the delimiter parameter. For more information about compression types and delimiters, please consult the AWS documentation.
You can probably see the pattern. To use DynamoDB, the FROM statement starts with dynamodb://
Using SSH is a little more complicated. The FROM statement specifies an ssh manifest file. It also must have ssh in the command. If it is missing, the COPY command will assume it is loading a text file from S3 and will fail.
The manifest file is a text file in JSON format that Redshift uses to connect to one or more hosts. It lists the SSH host endpoints and specifies the commands that will be executed on the hosts to return data to Amazon Redshift.
Optionally, include the host public key, the login user name, and a mandatory flag for each entry. For more information about the various options available using SSH, please refer to the AWS documentation.
Access for the copy command can be managed in two ways; using roles or an IAM user's Access Key ID and Secret Access Key. When using an IAM user's Access Key ID and Secret Access Key, that user must be authorized to access the source data. To use the COPY command with the Access Key ID, include the credentials parameter. However, AWS discourages the use of ever using the Access Key ID and Secret Access Key for anything other than an individual's personal use.
It is far too easy for these credentials to become outdated or--far worse--a security breach. If your day starts with the phrase, "It looks like someone's credentials were uploaded to Github," it is not going to be a good day for anyone.
The way to avoid this is to use the way AWS recommends for nearly all types of access: roles. It is also possible to specify an IAM role using the credentials parameter. For more information about IAM permissions with Redshift and the COPY command, please refer to the documentation.
As it loads the destination table, COPY will attempt to implicitly convert the strings in the source data to the data type of the target column. There are several types of conversion that can be explicitly specified. Please consult the documentation to learn about how they work.
There are also ways to change the default behavior of the data load operation. These behaviors include compression analysis, compression encodings, the maximum number of errors before causing a load to fail, testing the load, and table optimization after the load is completed. Again, please check the documentation for the updated list and examples of how these parameters work.
However, one of the operations is worth pointing out here: NOLOAD. Use NOLOAD to check the validity of the data file without actually loading the data. It's similar to doing a dry run. Use it to ensure the data file loads without error before doing the data load.
As one last topic, I want to move away from the COPY command and address the idea of doing upserts.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.