The course is part of this learning path
This course provides an introduction to Amazon Redshift. As well as getting a theoretical understanding of this service, you will also watch a guided demonstration showing you how to create an Amazon Redshift Cluster
- Understand the fundamentals of Amazon Redshift
- Learn about the architecture of the service
- Learn how to create an Amazon Redshift cluster
This course is intended for people that want to learn more about the fastest and most widely used cloud data warehouse provided by AWS, Amazon Redshift.
To get the most from this course, you should have a basic understanding of Data Analytics concepts.
Hello, and welcome to this lecture where I will look at Amazon Redshift. Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse. And it's designed for high performance and analysis of information capable of storing and processing petabytes of data and provide access to this data, using your existing business intelligence tools, using standard SQL. It operates as a relational database management system, and therefore is compatible with other RDBMS applications. Redshift itself is based upon PostgreSQL 8.0.2, but it contains a number of differences from PostgreSQL. These differences are out of scope for this course, but for more information, please refer to the documentation here.
A data warehouse is used to consolidate data from multiple sources to allow you to run business intelligent tools, across your data, to help you identify actionable business information, which can then be used to direct and drive your organization to make effective data-driven decisions to the benefit of your company.
As a result, using a data warehouse is a very effective way to manage your reporting and data analysis at scale. A data warehouse, by its very nature, needs to be able to store huge amounts of data and its data may be subjected to different data operations such as data cleansing, which as an example, may identify, correct, replace or remove incomplete records from a table or recordset.
This can be expanded upon for the need to perform an extract, transform and load or an ETL job. This is the common paradigm by which data from multiple systems is combined to a single database data store or warehouse for legacy storage or analytics.
Extraction is the process of retrieving data from one or more sources. Either online, brick & mortar, legacy data, Salesforce data and many others. After retrieving the data, ETL is to compute work that loads it into a staging area and prepares it for the next phase.
Transformation is the process of mapping, reformatting, conforming, adding meaning and more to prepare the data in a way that is more easily consumed. One example of this is the transformation and computation where currency amounts are converted from US dollars to euros.
Loading involves successfully inserting the transform data into the target database data store, or in this case, a data warehouse. All of this work is processed in what the business intelligent developers call an ETL job.
Now we have an understanding of what Amazon Redshift is. Let's move on to looking at the architecture of the service and the components that is built upon.
Let me start with clusters and nodes. A cluster can be considered the main or core component of the Amazon Redshift service. And in every cluster, it will run its own Redshift engine, which will contain at least one database. As the name implies, a cluster is effectively a grouping of another component, and these being compute nodes.
Each will contain at least one compute node. However, if the cluster is provisioned with more than one compute node, then Amazon Redshift will add another component called a leader node.
Compute nodes all contain their own quantity of CPU attached storage and memory. And there are different nodes that offer different performances. For example, the following RA3 node types. Also, as you can see here, the dense compute node types.
The leader node of the cluster has the role of coordinating communication between your compute nodes in your cluster and your external applications accessing your Redshift data warehouse. So the leader node is essentially gateway into your cluster from your applications. When external applications are querying the data in your warehouse, the leader node will create execution plans, containing code to return the required results from the database.
If the query from the external application references tables associated with the compute nodes, then this code is then distributed to the compute nodes in the cluster to obtain the required data, which is then sent back to the leader node. If the query does not reference tables stored on the compute nodes, then the query will run on the leader node only.
Each compute node itself is also split into slices, known as node slices. A node slice is simply a partition of a compute node where the nodes memory and disk spaces split. Each node slice then processes operations given by the leader node where parallel operations can then be performed across all slices and all nodes at once for the same query. As I mentioned previously, compute nodes can have different capacities and these capacities determine how many slices each compute node can be split into.
When creating a table, it is possible to distribute rows of that table across different nodes slices based upon how the distribution case is defined for the table. For a deeper understanding on how to select the best distribution style, please see the following link here.
When your Amazon Redshift database is created, you will of course connect to it using your applications. Typically these applications will be your analytic and business intelligence tools, that you're running with your organization. Communication between your BI applications and Redshift, will use industry standard open database connectivity, ODBC. And Java database conductivity, JDBC drivers for PostgreSQL.
The performance that Amazon Redshift can generate is a huge benefit to many organizations. In fact, at the time of writing this course, AWS currently boasts that it's three times faster than other cloud data warehouses.
From a query perspective, Amazon Redshift has a number of features to return results quickly and effectively. Let's take a look at a few of them.
Firstly, massively parallel processing. As highlighted in the previous section by associating rows from tables across different nodes slices and nodes. It allows the node leader to generate execution plans, to distribute crews from external applications across multiple compute nodes at once, allowing them to work together to generate the end result, which is an aggregated by the leader node.
Columnar data storage. This is used as a way of reducing the number of times the database has to perform disk I/O, which helps to enhance query performance. Reducing the data retrievals from the disk means there is more memory capacity to carry out in memory processing of the query results. Without caching, caching in general is a great way to implement a level of optimization.
Result caching helps to reduce the time it takes to carry out queries by caching some results of the queries in the memory of the leader node in a cluster. As a result, when a query is submitted, the leader node will check its own cache copy of the results and if a successful match is found, the cached results are used instead of executing another query on your Redshift cluster.
Amazon Redshift also integrates with Amazon CloudWatch, allowing you to monitor the performance of your physical resources, such as CPU utilization and throughput. In addition to this, Redshift also generates query and load performance data that enables you to track overall database performance. Any data relating to query and load performance is only accessible from within the Redshift console itself and not Amazon CloudWatch.
During the creation of your Redshift cluster, you can as an optional element, select up to 10 different IAM roles to associate with your cluster. This allows you to grant the Amazon Redshift principle, redshift.amazonaws.com access to other services on your behalf, for example, Amazon S3 where you might have a data lake. Accessing data within S3 will require a set of credentials to authorize Redshift access to S3. And the best way to do that is by using an IAM role. Therefore, if you intend to perform actions such as this when using your Amazon Redshift cluster, you might need to consider which access you need and what roles you will need to create.
To learn more about IAM and roles, please see our existing course here. In the next lecture, I want to show you how to create a new Redshift cluster.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.