Amazon Redshift is a fully managed petabyte-scale cloud data warehouse service offered by Amazon Web Services. It removes the overhead of months of efforts required in setting up the data warehouse and managing the hardware and software associated with it.
In this series of posts, we will be setting up a Redshift cluster, ingest some volume of data and play around with it. We will also take a look at some of the advanced options available such as understanding query plan to improve performance, workload management, cluster re-sizing, integration with other AWS Services.
Image courtesy: Amazon Web Services
Redshift based Cloud Data Warehouse Architecture
Let’s begin with a brief introduction of the Redshift architecture.
- Leader Node – the leader node parses the query, develops the query execution plan and distributes it to the compute nodes. The Leader Node is provisioned automatically by the service and is not billed
- Compute Node – this is the node that stores data and executes the query. Each Compute Node has its down compute, memory and storage
- Client Applications – client applications can be the standard ETL, BI and analytics tools
- Internal Networking – All the nodes are internally connected through a 10g network enabling faster data transfer between the nodes. The compute nodes are also not exposed to client applications. Client applications always talk to the Leader Node.
Here are some key features of Amazon Redshift:
In row-wise database storage (typically used in OLTP databases), data blocks store values sequentially for consecutive columns that make up a single row. This works for OLTP applications where most transactions read/write most of the columns in a row. Amazon Redshift employs columnar storage where data blocks store values of a single column of multiple rows. This means that reading the same number of column field values for the same number of records requires less I/O operations when compared to row-wise storage. This provides increased I/O performance and savings in storage space.
Redshift employs a Massively Parallel Processing (MPP) architecture that can distribute SQL operations across all available resources (nodes) resulting in very high query performance. A Redshift cluster comprises of a Leader Node automatically provisioned whenever there is more than one compute node. The leader node parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
The number of nodes in a Redshift cluster can be dynamically changed through the AWS Management Console or the API. We can add more nodes to the cluster for increased performance or if we need more storage. We can start with a single 160GB DW2. Large node and scale all the way up to a petabyte. During the scaling activity, the cluster is placed in a read-only mode and all the data is copied to a new cluster. Once the new cluster is fully operational, the old cluster is terminated and this process is entirely transparent to the clients. During this activity, the query performance can be slower.
Data stored in Redshift is automatically (by default) compressed. Compressed data reduce disk usage and data is uncompressed after loading it into memory during query execution. Since Redshift employs columnar storage, Redshift can apply appropriate compression encodings that are tied to the column type.
Redshift comes with loads of security features including:
- Virtual Private Cloud: You can launch Redshift within VPC and control access to the cluster through the virtual networking environment
- Encryption: Data stored in Redshift can be encrypted. This can be configured when creating the tables in Redshift
- SSL: To encrypt connections between clients and Redshift, SSL encryption can be used
- Data in transit encryption: Redshift uses hardware accelerated SSL while connecting to Amazon S3 or DynamoDB (during import, export, backup)
From backups to monitoring to applying patches to upgrades, Redshift is fully managed by AWS. Data stored in Redshift is replicated in all the cluster nodes and automatically backed up as Snapshots and stored (for a user-defined time period) in S3. Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary.
New on Cloud Academy: Red Hat, Agile, OWASP Labs, Amazon SageMaker Lab, Linux Command Line Lab, SQL, Git Labs, Scrum Master, Azure Architects Lab, and Much More
Happy New Year! We hope you're ready to kick your training in overdrive in 2020 because we have a ton of new content for you. Not only do we have a bunch of new courses, hands-on labs, and lab challenges on AWS, Azure, and Google Cloud, but we also have three new courses on Red Hat, th...
Cloud Academy’s Blog Digest: Azure Best Practices, 6 Reasons You Should Get AWS Certified, Google Cloud Certification Prep, and more
Happy Holidays from Cloud Academy We hope you have a wonderful holiday season filled with family, friends, and plenty of food. Here at Cloud Academy, we are thankful for our amazing customer like you. Since this time of year can be stressful, we’re sharing a few of our latest article...
Google Cloud Platform Certification: Preparation and Prerequisites
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2019, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the second consecuti...
New Lab Challenges: Push Your Skills to the Next Level
Build hands-on experience using real accounts on AWS, Azure, Google Cloud Platform, and more Meaningful cloud skills require more than book knowledge. Hands-on experience is required to translate knowledge into real-world results. We see this time and time again in studies about how pe...
New on Cloud Academy: AWS Solution Architect Lab Challenge, Azure Hands-on Labs, Foundation Certificate in Cyber Security, and Much More
Now that Thanksgiving is over and the craziness of Black Friday has died down, it's now time for the busiest season of the year. Whether you're a last-minute shopper or you already have your shopping done, the holidays bring so much more excitement than any other time of year. Since our...
Understanding Enterprise Cloud Migration
What is enterprise cloud migration? Cloud migration is about moving your data, applications, and even infrastructure from your on-premises computers or infrastructure to a virtual pool of on-demand, shared resources that offer compute, storage, and network services at scale. Why d...
6 Reasons Why You Should Get an AWS Certification This Year
In the past decade, the rise of cloud computing has been undeniable. Businesses of all sizes are moving their infrastructure and applications to the cloud. This is partly because the cloud allows businesses and their employees to access important information from just about anywhere. ...
AWS Regions and Availability Zones: The Simplest Explanation You Will Ever Find Around
The basics of AWS Regions and Availability Zones We’re going to treat this article as a sort of AWS 101 — it’ll be a quick primer on AWS Regions and Availability Zones that will be useful for understanding the basics of how AWS infrastructure is organized. We’ll define each section,...
Application Load Balancer vs. Classic Load Balancer
What is an Elastic Load Balancer? This post covers basics of what an Elastic Load Balancer is, and two of its examples: Application Load Balancers and Classic Load Balancers. For additional information — including a comparison that explains Network Load Balancers — check out our post o...
Advantages and Disadvantages of Microservices Architecture
What are microservices? Let's start our discussion by setting a foundation of what microservices are. Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs). ...
Kubernetes Services: AWS vs. Azure vs. Google Cloud
Kubernetes is a popular open-source container orchestration platform that allows us to deploy and manage multi-container applications at scale. Businesses are rapidly adopting this revolutionary technology to modernize their applications. Cloud service providers — such as Amazon Web Ser...
AWS Internet of Things (IoT): The 3 Services You Need to Know
The Internet of Things (IoT) embeds technology into any physical thing to enable never-before-seen levels of connectivity. IoT is revolutionizing industries and creating many new market opportunities. Cloud services play an important role in enabling deployment of IoT solutions that min...