Managing Cold Data
Managing Cold Data

This course covers Amazon Redshift Spectrum, including what it is, what it does, how it works, and some points to take into consideration when using Redshift Spectrum.

Learning Objectives

  • How to manage cold data in Redshift using Amazon S3
  • What Amazon Redshift Spectrum is and does
  • How Spectrum Queries work
  • Supported data formats of Spectrum
  • File optimization using Spectrum
  • Amazon Redshift Spectrum Considerations

Intended Audience

This course is intended for people that want to learn more about Amazon Redshift Spectrum and how it can be used to perform SQL queries on data stored in Amazon S3.


To get the most from this course, you should have a basic understanding of Amazon Redshift, Amazon Athena, AWS Glue, and data analytics concepts.


Managing cold data. Amazon Redshift is a cloud-native data warehouse from AWS that is petabyte scalable. A petabyte is 1000 terabytes and Redshift can store up to two petabytes of raw data. There are a pair of issues that come with this much data storage. One of them is conceptualizing this much data. The other, and probably a more practical problem, is dealing with cold data. That is, data that is infrequently accessed. 

One way to deal with cold data is to simply delete it. Problem solved, right? Well, maybe. But if that data is needed, even once or twice a year, deleting it from the cluster will create a new problem. How will you restore it when you need it? Keeping it in the cluster, however, is expensive. I'm not sure that I know of anyone who wants to pay for cluster space for data that is rarely used and grows in size every year. To address this issue, AWS launched Redshift Spectrum in 2017.

Spectrum, as part of Amazon Redshift, uses SQL to query data that is stored as files in Amazon S3. More than that, it can do SQL joins between S3 data and tables on the Redshift cluster. It's hot and cold data working seamlessly together. "Any sufficiently advanced technology is indistinguishable from magic," Arthur C. Clarke. To me, this is nothing short of magical. By putting data in S3 and treating it like a native Redshift table, it saves space on Redshift clusters and reduces storage costs. As a bonus, with more available space and less overhead regarding data distribution and networking, overall query performance is increased.

About the Author
Learning Paths

Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.

Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.

Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.

In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.