This course covers Amazon Redshift Spectrum, including what it is, what it does, how it works, and some points to take into consideration when using Redshift Spectrum.
- How to manage cold data in Redshift using Amazon S3
- What Amazon Redshift Spectrum is and does
- How Spectrum Queries work
- Supported data formats of Spectrum
- File optimization using Spectrum
- Amazon Redshift Spectrum Considerations
This course is intended for people that want to learn more about Amazon Redshift Spectrum and how it can be used to perform SQL queries on data stored in Amazon S3.
To get the most from this course, you should have a basic understanding of Amazon Redshift, Amazon Athena, AWS Glue, and data analytics concepts.
Overview of Amazon Redshift Spectrum. How does Redshift Spectrum work? It almost seems like magic. Essentially, the data stored in S3 is formatted like a Redshift table and cataloged with something like AWS Glue. That's the high-level explanation of what happens. Amazon Redshift Spectrum nodes are dedicated Amazon Redshift Servers managed by AWS that are independent of customer-provisioned clusters. Because of this, Redshift Spectrum Queries use much less of a cluster's processing capacity than other queries because compute-intensive activity is pushed into the spectrum nodes.
Based on the demands of queries, Redshift Spectrum can potentially scale to use thousands of spectrum nodes to take advantage of Redshift's massively parallel processing architecture. Redshift Spectrum Tables are created by defining the structure of external files and then registering them as tables in an external data catalog. You can think of it as an external table. The data catalog can be AWS Glue. The data catalog that comes with Amazon Athena or an Apache Hive Metastore.
External tables can be created and managed, either by using Redshift's data definition language commands, or with any other tool that can connect to the external data catalog. Changes to the external data catalog are immediately available to Amazon Redshift. Optionally, the external tables can be partitioned on one or more columns. Though, it's a good idea to partition data. When Redshift executes a query, it creates a query plan.
If the data is partitioned, the query plan will know what data it needs and, just as importantly, know what data to skip. This means defining partitions as part of the external table can improve performance. After the Redshift tables have been defined, tables can be queried and joined, like any other Redshift table. Keep in mind that these tables are read only. Update operations are not possible. Redshift Spectrum Tables can be added to multiple Amazon Redshift clusters. This means the same data stored in S3 can be queried by any Redshift cluster in the same AWS region. When the Amazon S3 files are updated, the data is immediately available for query.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.