Amazon Redshift Spectrum
The course is part of this learning path
This course covers Amazon Redshift Spectrum, including what it is, what it does, how it works, and some points to take into consideration when using Redshift Spectrum.
- How to manage cold data in Redshift using Amazon S3
- What Amazon Redshift Spectrum is and does
- How Spectrum Queries work
- Supported data formats of Spectrum
- File optimization using Spectrum
- Amazon Redshift Spectrum Considerations
This course is intended for people that want to learn more about Amazon Redshift Spectrum and how it can be used to perform SQL queries on data stored in Amazon S3.
To get the most from this course, you should have a basic understanding of Amazon Redshift, Amazon Athena, AWS Glue, and data analytics concepts.
Spectrum Internals. Amazon Redshift's Query Processing engine works the same for both internal and external tables. Redshift Spectrum queries are similar to Amazon Athena. The main difference is that, when using Athena, the process is fully serverless. It skips the data warehouse entirely. In contrast, Spectrum is used as part of Amazon Redshift to perform complex data analytics and aggregations. That Spectrum works with Redshift is probably the primary value proposition.
For those people that are already running workloads using Redshift, Spectrum can expand the amount of data query to exabytes without needing to change or update their tools. Redshift Spectrum supports a number of different structured and semi-structured file formats shown here. Depending on the format used, it is possible to do split reads. This means it's possible for Spectrum to distribute the file processing across multiple independent requests, instead of having to read the entire file in a single request.
AWS recommends using a columnar format, like Apache Parquet or Apache ORC, when storing data in S3. Then, when transferring data from S3, Redshift Queries will select only the columns needed. Avoiding the scanning of unneeded columns saves on cost. Spectrum Nodes are really just Redshift clusters hidden from view. They use the same massively parallel processing to perform queries.
There are two ways to optimize data saved in S3 for this parallel processing. First, use multiple files. If the file format or compression method does not support reading in parallel, break large files into smaller ones. AWS recommends file sizes between 64 megabytes and one gigabyte. Keep the file sizes consistent. This allows Redshift to distribute the workload evenly. If one node has to do more work because it has a large file size, the other nodes have to wait until it finishes before they can return the results. So that one node becomes a bottleneck.
The other way is to use compression. Redshift Spectrum supports three types of compression. GZIP, BZip2, and Snappy. Consult the documentation for more information on the file formats and compression types used by Redshift Spectrum. Redshift Spectrum can transparently decrypt two types of encrypted data. The first one is using server-side encryption using an AES-256 key managed by S3. The second is server-side encryption with keys managed by the AWS Key Management Service, KMS. Spectrum does not support S3 client-side encryption.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.