This course covers how to load your data efficiently using the COPY command in Amazon Redshift. You'll learn how it can be used to load multiple or a single large file into Amazon Redshift, as well as the different features that the COPY command has to offer and what data upserting is.
Learning Objectives
- Understand the essentials of the COPY command in Redshift and its features
- Learn how to load single or multiple large files into Redshift
- Learn what upserting is and why it's important
Intended Audience
This course is intended for people that want to learn more about how to load their data efficiently using the COPY command.
Prerequisites
To get the most from this course, you should have a basic understanding of Amazon Redshift and Data Analytics concepts.
Amazon Redshift is a fully managed, cloud-native data warehouse from AWS designed for large-scale data storage and analysis.
Redshift's origins are in Postgres, but it has been heavily modified to manage Big Data-type workloads. It is a columnar database with a shared-nothing architecture that does massively parallel processing.
Columnar databases are incredibly efficient at processing data that needs to be tabulated or totaled.
For example, inside Redshift, if you need to get the totals from a year's worth of sales data it is already in a column and ready to be processed.
Redshift can execute a query against that column by selecting the starting and ending rows and then--literally--doing the math.
In a row-based database, this same transaction is much slower because it has to query the same table multiple times to get the data from each row, store that data in memory, and then do the calculation.
This might not be a problem if there are 100 rows. However, where there are thousands of rows--or even millions--the time and effort required to process this type of query can be prohibitively expensive in terms of both time and money.
When it comes to data entry, even though Redshift looks and feels like a relational database, it should not be treated like one. It is for large-scale data analytics and not individual transactions.
For example, when adding entire datasets to Amazon Redshift, the SQL INSERT statement is inefficient and should not be used.
Instead, to do any sort of bulk loading, use the COPY command. There are a number of reasons why this is the most appropriate way to get data into Redshift.
While the COPY command is usually the best way to get data into Redshift, there are times that the INSERT statement is appropriate. I'll get to that a little later in this course.
The reason that the COPY command is used to bulk load data is related to Redshift's architecture.
Redshift is a massively parallel processing database. A leader node evaluates queries and then distributes the work equally between the available compute nodes.
Simply inserting data a record or two at a time prevents Redshift from engaging its parallelism.
Diving deeper into how this parallelism works and why it's important, Redshift has a shared-nothing architecture. Each compute node is subdivided into slices and each slice has one or more dedicated CPU cores.
While node slices are outside the scope of this course, be aware that the number of slices per node depends on the type of node and, for efficient processing, each slice should do an equal amount of work.
When individual records are loaded from a single file or if it is being loaded from multiple files that vary in size, it is possible that some slices will do more work than others.
This is bad because the process as a whole of loading data only runs as fast as the slowest slice.
Using multiple INSERT statements puts too much load on a single node slice and, as a result, slows the entire cluster.
This is why, to load data into Amazon Redshift, it is best to use the COPY command.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.