The course is part of this learning path
This course covers Amazon Redshift Spectrum, including what it is, what it does, how it works, and some points to take into consideration when using Redshift Spectrum.
Learning Objectives
- How to manage cold data in Redshift using Amazon S3
- What Amazon Redshift Spectrum is and does
- How Spectrum Queries work
- Supported data formats of Spectrum
- File optimization using Spectrum
- Amazon Redshift Spectrum Considerations
Intended Audience
This course is intended for people that want to learn more about Amazon Redshift Spectrum and how it can be used to perform SQL queries on data stored in Amazon S3.
Prerequisites
To get the most from this course, you should have a basic understanding of Amazon Redshift, Amazon Athena, AWS Glue, and data analytics concepts.
Amazon Redshift Spectrum Considerations. There are some things to remember when using Redshift Spectrum. I've mentioned a couple of them already, but they are important enough to be repeated. The Amazon Redshift cluster and the S3 bucket holding data must be in the same AWS region. When a Redshift cluster is configured to use Enhanced VPC Routing it will impact how security is implemented. This is because Enhanced VPC Routing forces specific traffic to go through a VPC instead of Amazon's network, and Redshift Spectrum runs on AWS-managed resources.
Access can be controlled to data in an Amazon S3 bucket by using a bucket policy attached to the bucket and using an IAM role attached to the cluster. Redshift Spectrum cannot access data stored in Amazon S3 buckets that use a bucket policy that restricts access only to specified VPC endpoints. Instead, use a bucket policy that restricts access only to specific principals, such as a specific AWS account or specific users.
For the IAM role that is granted access to the bucket, use a trust relationship that allows the role to be assumed only by the Amazon Redshift service principal. Then, when attached to a cluster, the role can be used only in the context of Amazon Redshift and can't be shared outside of the cluster. One benefit of using Amazon Redshift Enhanced VPC Routing is that all copy and unload traffic is logged in VPC flow logs. Traffic originating from Redshift Spectrum to Amazon S3 doesn't pass through a VPC, so it is not logged in the VPC flow logs.
When Redshift Spectrum accesses data in Amazon S3, it performs these operations in the context of the AWS account and respective role privileges. Log and audit Amazon S3 access using server access logging in AWS CloudTrail and Amazon S3. One last consideration regarding Enhanced VPC Routing involves the data catalog. Redshift Spectrum can access data catalogs in AWS Glue or Amazon Athena. To enable access to AWS Glue or Athena, configure a VPC with an Internet gateway, NAT gateway, or a VPC endpoint. Then, configure the VPC security groups to allow outbound traffic to the public endpoints for AWS Glue and Athena.
There are two things to be aware of here. One is that when using a VPC interface endpoint, communication between a VPC and AWS Glue is sent through the AWS network. The other is that endpoints are not free. It is impossible to perform UPDATE or DELETE operations on external tables. After all, they are objects stored in S3. However, there is a workaround. Create a new external table and insert only the values needed into it. The only way to control user permissions on an external table is using an AWS Glue Data Catalog that is enabled for AWS Lake Formation.
If access control is required, consider granting or revoking permissions on the external schema. It will have a similar effect of controlling user permissions. Also, regarding permissions, in order to use Redshift Spectrum, database users must have permission to create temporary tables in the database. There are a couple of ways to optimize query performance. I think one of the easiest ways is to use a columnar file format. This prevents Spectrum nodes from doing unnecessary scans.
Even though Redshift Spectrum nodes are queried using standard SQL, there are some commands, filters, and aggregations that will not work on them. If they're part of a query the query planner will not include the data stored in S3. For example, avoid the SQL commands, Distinct and Order By. Spectrum-friendly things are, Group By clauses, comparison conditions, and pattern matching conditions such as Like, and aggregate functions such as, Count, Sum, Average, Min and Max.
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.