Loading Data into Amazon Redshift
The course is part of these learning paths
This course covers how to load your data efficiently using the COPY command in Amazon Redshift. You'll learn how it can be used to load multiple or a single large file into Amazon Redshift, as well as the different features that the COPY command has to offer and what data upserting is.
- Understand the essentials of the COPY command in Redshift and its features
- Learn how to load single or multiple large files into Redshift
- Learn what upserting is and why it's important
This course is intended for people that want to learn more about how to load their data efficiently using the COPY command.
To get the most from this course, you should have a basic understanding of Amazon Redshift and Data Analytics concepts.
When I first saw the word upsert, I thought it was a typographical error.
It turns out that upsert--a portmanteau of Update and Insert--is indeed a thing and an important concept to understand.
I've explained how the COPY command appends new data to the end of a table without modifying existing rows.
This is great for adding new data.
However, what happens when a table needs to have existing data updated as well as new data added at the same time?
The COPY command only appends data to existing tables. It does not have any built-in logic to check for existing records.
To make things more complicated, it's possible--by design--to have duplicate data in a Redshift database.
For relational databases the opposite is true; using primary keys ensures that data is not duplicated.
However, in Amazon Redshift, while tables can have primary keys they are NOT enforced.
Put simply, in Amazon Redshift, primary keys are ignored and loading data with the COPY command can--and often does--result in duplicates.
To prevent data duplication, perform an UPSERT operation programmatically using SQL.
An UPSERT merges data by updating existing records and adding new ones.
Some relational databases support a single UPSERT command. Redshift does not. It must be done programmatically using temporary staging tables.
The process works like this...
- Create a temporary table configured like the destination table in Redshift.
- COPY the records to this staging table.
- JOIN the staging table with the target table.
- UPDATE the existing data in the Redshift table.
- INSERT new data to the Redshift table.
- Drop the staging table.
The trick--if you can call it that--is to do this all at once with a single COMMIT to Redshift.
To illustrate, I've created some simple SQL that walks the process with a simple SALES table.
While SQL convention is to use all capital letters, it is not required.
To aid with some of the visuals and distinguish between the permanent Redshift table and the temporary staging table, I've kept the SQL statement and Redshift tables in UPPERCASE and the temporary tables lowercase.
This is what it looks like.
Let's go through this SQL code together.
The double-dash (--) is a single-line comment. Anything after them is ignored.
The BEGIN keyword starts a statement block.
SQL statement blocks are used when a number of statements need to be treated as a one transaction with a single commit.
Statement blocks are sometimes called a batch.
The SQL statements run but are not committed to Redshift until either an END or COMMIT statement is reached.
Inside the statement block, the statement...
CREATE TEMPORARY TABLE temp_sales (LIKE SALES);
...does exactly what it sounds like it is doing.
It creates a temporary table called temp_sales.
The LIKE statement means it inherits its schema information from the SALES table. This includes the distribution key and sort key attributes.
The COPY statement loads data into the staging table. In this example, it's being loaded from S3.
The UPDATE statement compares temp_sales to SALES.
This statement includes an alias. Aliases are optional. The AS keyword is optional too.
UPDATE SALES AS S
...is equivalent to…
UPDATE SALES S
I like having the AS keyword because it makes the SQL more readable.
Someday, someone is probably going to look at your code to update it, so why not make their life a little easier?
The UPDATE statement compares the productid fields in the two tables and then updates the matching records in the Redshift SALES table.
Then, the INSERT statement does a LEFT JOIN to make some magic happen.
The LEFT JOIN returns all the rows from the left table and the matching rows from the right table.
The magic needed is to find the data in the staging table that does not exist in the Redshift table.
It's a fancy SQL-based Venn diagram. Rather than the overlap, I need to know what's missing.
After the JOIN, if there is no match, the result is NULL.
Using those NULL fields, the INSERT operation adds the new data from the staging table.
As the saying goes:
Any sufficiently advanced technology is indistinguishable from magic.
--Arthur C. Clarke
When the END statement is reached, the statement block is closed and changes are committed to Redshift as a single transaction.
If it fails, Redshift will automatically roll back the transaction to maintain the integrity of the database.
Also, I want to point out that, while the merge operation is a single transaction, the creating and dropping of the temporary staging table is separate from it.
Sorting and vacuuming are expensive operations. I mention this fact because they should be done only when needed.
As it happens, if you use an UPSERT method to COPY new data into a table, the table needs to be sorted.
Since UPSERT performs a DELETE, it may leave the table in an unsorted state. When a record is deleted from a Redshift table, it is only marked for deleting, it is not automatically removed.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.