In this course, we outline the key storage options for big data solutions. We determine data access and retrieval patterns, and some of the use cases that suit particular data patterns such as evaluating mechanisms for capture, update, and retrieval of catalog entries. We learn how to determine appropriate data structure and storage formats, and how to determine and optimize the operational characteristics of a Big Data storage solution.
Learning Objectives
- Recognize and explain big data access and retrieval patterns.
- Recognize and explain appropriate data structure and storage formats.
- Recognize and explain the operational characteristics of a Big Data storage solution.
Intended Audience
This course is intended for students looking to increase their knowledge of the AWS storage options available for Big Data solutions.
Prerequisites
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud storage solutions. Our courses on AWS storage fundamentals and AWS Database Fundamentals will give you a solid foundation for taking this present course.
Updates
Amazon Aurora is now MySQL and PostgreSQL-compatible.
And welcome to Big Data on AWS. Today we're gonna look at storing data with Amazon Redshift. At the end of this module you will be able to describe in detail how Amazon Redshift can be used to store data within a big data solution. In previous modules, we've covered Amazon S3, DynamoDB, and Amazon RDS, and now we're gonna look at how Amazon Redshift works and how you can use it in some big data scenarios.
Amazon Redshift is primarily designed to store data and serve as access to this data via a myriad of business intelligence and query tools. But you can also process data while it's contained within Amazon Redshift via the use of third-party ETL tools, manual tool commands executed within Amazon Redshift, or via user-defined functions written in Python. When processing transformations in Amazon Redshift, it is far more efficient to undertake what is termed ELT, where we extract, load and then transform the data within the database. This means that the transformations are executed within reach of itself compared to ETL, which is where we extract the data, transform it in third-party tools or engines, and then load the data back into Amazon Redshift.
When choosing a big data storage solution from within the AWS service offerings, it's important to determine whether the data sources we are primarily storing contain structured, semi-structured, or unstructured data. This will typically drive the decision on which AWS service is the best for the data pattern or use case. Amazon Redshift is primarily designed to manage structured data. As well as storing data, Amazon Redshift is also able to process and transform the data within the database.
When choosing a big data processing solution from within the available AWS service offerings, it's important to determine whether you need the latency of response from the process to be in seconds, minutes, or hours. This will typically drive the decision on which AWS service is best for that processing pattern or use case. Amazon Redshift is primarily designed to deliver batch-orientated processing. Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse, and is designed to store and process large volumes of data, and provide access to this data using your existing business intelligence tools. Amazon redshift is based on a clustered architecture, often referred to as a grid architecture.
The core container of an Amazon Redshift data warehouse is called a cluster. A cluster is composed of one or more compute nodes. Compute nodes are the worker nodes, they execute the SQL code that transforms or presents the data. If you have two or more compute nodes, an additional leader node is automatically introduced into your cluster. Your applications will then only communicate with the leader node, and the leader node compiles and distributes the query code to the applicable compute node, to be executed if required. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation, the results of which are then returned to your applications.
This approach allows you to leverage the increased compute power that is gained by adding additional compute nodes to you cluster. Adding additional compute nodes to a cluster is called horizontal scaling, compared to make each compute node larger by adding additional CPUs, memory and disk which is called vertical scaling. These compute nodes are transparent to your application. Your application never sees them. In a multi-node cluster, only the leader node communicates with the compute nodes. Each compute node is partitioned into slices.
Each slice has allocated a part of the computer node's memory and disk. You can think of a slice like a virtual compute node. The leader node manages the distribution of a subset of data to each slice, and also manages allocating queries to the relevant slice which holds the data. This allows queries to be run in parallel on a compute node, reducing the latency by which a result is returned, and more effectively leveraging the CPU, memory, and data available to each compute node. The number of slices available is automatically determined by the size and type by the Redshift compute node you provision. For example, for dense storage compute nodes, at one times large, extra large, has two slices per node, while on eight times large has 16 slices per node. This is because the eight times extra large obviously has more CPU, memory and disk available, so it can be divided into a larger number of slices.
When you create a data table, you can optionally specify one column in this table as the distribution key. When you load data into the table, the rows are distributed to the slices according to the distribution key that is defined. So for example if you have a distribution key based on year, then the data for each year will be distributed across different node slices.
Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and execute the queries more efficiently, meaning you get better performance. When you run a cluster with at least two compute nodes, data on each node will always be mirrored on disks on another node, and this reduces the risk of incurring data loss if a node becomes unavailable. The mirroring does not use any of the disk that are on the secondary node. So for example, if you provision a compute node with two terabytes of disk available, any data that is mirrored on this node from another compute node does not reduce the two terabyte available in any way. You can also have a single node Redshift cluster.
Of course this cluster will have no data duplication built in as it is a single node. So Amazon recommends a minimum of two nodes for production environments. An Amazon Redshift cluster can contain one, or many databases. Databases are a way of logically grouping tables, which hold data and can be queried. So for example, you might created a database to hold human resources data, and another database to hold finance data. Some queries are distributed and executed on the compute nodes, and other queries must be executed exclusively on the leader node. This is because some SQL functions are supported only on the leader node, and are not supported on the compute nodes.
A query that references only catalog tables, they're the tables that start with a PG prefix, such as PG_table_diff, or that does not reference any tables runs exclusively on the leader node. An example of that is the CURRENT_SCHEMA function. When a query references user created tables or system tables, tables that have a prefix of an STL, or an STV prefix, and systems views who have an SVL or an SVV prefix, then the leader node will pass these queries to the compute nodes for execution. So effectively, if you're querying about data, that executes on the compute node. If you're querying about system metadata, then it executes on the leader node.
Remember, queries that use a leader node function must execute exclusively on their leader node, not on the compute nodes, or you will be returned with an error, and nobody likes errors. So let's have a look at how you provision an Amazon Redshift cluster. As you can see, the first choice is to select your node type, which determines the number of virtual CPUs, the amount of memory and the volume of storage that will be provisioned.
The next choice is the number of nodes in the cluster. And that's pretty much it. Amazon makes standing up your Redshift cluster as easy as any other Amazon service. A number of different types and sizing options exist for your Redshift nodes. There are two types of Redshift nodes; The dense storage node types start with ds, and are optimized for storing large volumes of data. The dense compute node types, you guessed it, they start with dc, are optimized for faster access to smaller volumes of data. The dense storage node types are optimized for large data workloads and use standard hard disk drives for storage. Within the ds node types, there are four different categories. You can choose between the extra large, and the eight extra large nodes. The decisions on which to use is driven by the amount of disk you need to have available. Two terabytes, or the extra large 16 terabytes for the eight extra large.
There is a limit to how many nodes you can attach to a cluster, which we will discuss soon, so please know the total capacity limit on the table. If you're going to exceed 64 terabytes of storage, then you will need to use the eight extra large nodes. Also note that unlike other services like EC2s, there are no node sizes between extra large and eight times extra large, you only have the two choices. After disk requirements, the other variable that you need to use to determine which node size to use is the amount of virtual CPUs and memory required. The ds2 options have more virtual CPUs and memory allocated than the ds1 options, so they will have a higher level performance, but of course, they also cost more. The dense compute dc nodes are optimized for performance as they use solid state drives, SSDs, which give a much higher level of performance due to increased I/O throughput. But as they are using SSDs, they provide a much smaller storage size compared to the HDD nodes.
For example, the ds extra large has two terabytes of disk per node available, while the dc extra large only has 164 gigabytes. This typically means you will not use the dc nodes in a production, big data environment, as the data volumes will typically exceed this load. The node type that you choose depends heavily on the amount of data you are loading into Amazon Redshift, the complexity of the queries in operations that you are running, and the response time needs of the downstream applications that the accessing this data.
You can of course transition from smaller node size to a larger node size in the future if required, one of the many benefits of AWS' elestic scalability. One important thing to note is the minimum number of nodes for a cluster using eight times large nodes is two nodes. You cannot provision a single node Amazon Redshift cluster if you are using the eight times large node size. Amazon Redshift achieves extremely fast query execution by employing these performances features. The massively parallel processing, or MPP, is a grid capability we discussed earlier where you can add additional compute nodes to horizontally scale the environment.
This ability to spread the data and the query execution across multiple nodes results in improved performance. Amazon Redshift uses a columnar storage approach when storing the database table information. This columnar approach results in reduced disk I/O requests and reduces the amount of data that is read from the disk when queried. This data being read means more queries can be executed purely in memory on the servers, which improves the query performance. To get the best of this columnar approach, it is important the query processor is able to rapidly filter out a large subset of the data blocks, which is why setting the best distribution key is so important.
Data is compressed on disk, which both reduces the storage requirements, that means the less data is required to be written to memory. Once the compressed data is written to memory, it is automatically uncompressed when accessed by a query. Again, this effective use of memory results in a higher level of performance. You can apply compression encodings manually, but Amazon strongly recommends you use the COPY command to load data, and let it analyze and apply the compression algorithms automatically. Be default, the COPY command applies automatic compression and it balances overall performance when it chooses the compression encodings.
The COPY command requires enough rows in the data, at least 100,000 rows per slice to generate a meaningful sample on which to base the automatic compression analysis. This process is only performed on loading of an empty table. Once they complete, the compression encoding is sent for the table, and when you add more data to the table later, the appended rows are compressed using the existing encoding. Amazon Redshift has a built-in query execution engine which includes a query optimizer that is MPP and columnar aware. This query optimizer provides significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, sub-queries, and aggregation.
The query optimizer both rewrites the query when necessary and generates a query plan to ensure it's executed with the best performance. The query plan specifies execution options such as join types, join order, aggregation options, and data distribution requirements. The leader node distributes compiled code to the compute nodes in the cluster. Compiling the query eliminates the overhead associated when running an interpreter on the computer nodes, and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster. Be careful, though the execution engine compiles different code for the JDBC connection protocol, and for the ODBC connection protocols, two client applications using different protocols will each incur the first time cost of compiling the code. Other clients that use the same protocol however will benefit from sharing the cached code. When users run queries in Amazon Redshift, the queries are routed to query queues.
This is a common approach for MPP styles of data storage. Unlike standard RDBMS data storage platforms, which are designed to receive and immediately execute a large number of small queries from a large number of users at any one time, MPP, with grid style databases, like Amazon Redshift, receive the query and then manage the execution of the query by the use of queues. Queues are effectively racked and stacked. Queries are effectively racked and stacked in a queue until resources are available for the query to be executed. This may result in latency between when the query is submitted, and when it is executed.
One of the hard limits within Amazon Redshift is a maximum of 50 user-defined query queues. Each query queue can be configured to run up to 50 queries concurrently, but the maximum total concurrency level for all user-defined queries, queues, is 50. This means that you are effectively limited to a maximum of 50 concurrent queries at any one time. Due to this, you will sometimes want to manage the priority of queries so that the important queries do not get stuck behind slower running or less important queries. You do this by creating multiple query queues, and then maintaining the properties of these queues to help manage their priorities. You can configure WLM properties for each query queue to specify the way that memory is allocated amongst the slots. How queries can be routed to specific queues at run time, and when to cancel long-running queries. You can configure the following for each query queue; The concurrency level. Queries in a queue run concurrently until they reach the concurrency level defined for the queue.
Subsequent queries then wait in the queue until a previous query completes. As a best practice, Amazon recommends using a concurrency level of 15 or lower for a queue to reduce contention for system resources. User groups. You can assign a set of user groups to a queue by specifying each user group name, or by using wildcards. When a member of a listed user group runs a query, that query runs in the corresponding queue. Query groups. You can assign a query group to a queue, which is simply a label.
When submitting a query, you assign the query group label, and the query will run in the corresponding queue. By default, each user-defined queue has allocated an equal proportion of the memory that is available. So if you have four user-defined query queues, they will each have 25% of the available memory allocated. You can override this to hard code the percentage of memory that is allocated. You can also limit the amount of time that queries in a given WLM queue are permitted to run for.
The timeout parameter specifies the amount of time in milliseconds that Amazon Redshift waits for a query to execute before canceling the queue. The timeout is based on query execution time, and doesn't include the time spent waiting in a queue. Of course, canceling a query in the middle of its execution is never a pleasant thing for a user to experience, so it's probably better to route their queries to a dedicated long run query queue instead. By default, Amazon Redshift configures a super user in a default user queue.
The default queue is initially configured to run five queries concurrently. You can change the concurrency, the timeout, and the memory allocation properties for the default queue, but you cannot specify user groups or query groups for their default queue. Any queries that are not routed to other queues run in a default queue.
There are a number of tuning options you can manage within Amazon Redshift to improve or tailor, performance-based on the type of data you are dealing with. These include distribution styles, sort keys, and compression encodings. Defining a distribution style determines on which slice the table the data is stored. It's similar to partitioning in a standard relational database, but not exactly the same. When you create a table, you designate one of three distribution styles; Even, key, or all.
Even will do a round robin distribution of data. Even distribution is the default distribution style. Key requires a single column to be defined as a disk key. On load, Amazon Redshift hashes each disk key column value and routes hashes to the same slice consistently. And all distribution stores a full copy of the table on the first slice of each node. All distribution multiplies the storage required by the number of nodes in a cluster, and so it takes much longer to load, update, or insert data into multiple tables. All distribution is appropriate only for relatively slow moving tables, that is, tables that are not updated frequently or extensively. Small dimension tables do not benefit significantly from all distribution, because the cost of redistribution is low, and the data is readily available on every node.
From a performance point of view, generally it's best to strive for key whenever appropriate. Choose all in the scenarios where it makes sense, and key doesn't. Only choose even when neither key nor all is appropriate. Defining a table with a sort key results in a physical ordering of data within each slice. Amazon Redshift allows for a table to be defined with compound sort keys, interleaved sort keys, or no sort keys. Each of these styles of sort keys is useful for certain table access patterns.
You can define or more of its columns as sort keys. A compound key is made up of all the columns listed in the sort key definition in the order they are listed. An interleaved sort key gives equal weight to each column, or a subset of columns in the sort key. No sort keys is what it means, there is no sort key defined. A compound sort key is more efficient when query predicates user prefix, which is a subset of the sort key columns in order. The performance benefits of compounding sort keys decrease when queries depend only on secondary sort columns, without referencing the primary columns of the sort key. An interleaved sort key gives equal weight to each column in the sort key, so queries can use any of the subsets of the columns, and make up the sort key in any order.
In practice, a compound sort key is more appropriate for over 90% of the workloads. An interleaved sort is more effective with large tables. Data compression in database systems isn't new. However, historically it was used to reduce data footprint rather than boosting performance, because of the expensive decompression overhead. In Amazon Redshift, using column encodings translate to both a data footprint reduction and a performance gain, because the cost of decompression is negligible relative to the reduction of disk I/O.
The COPY command automatically performs automatic compression analysis before data ingestion. This process first samples a subset of the data to be loaded and then performs a brute force compression comparison. To do so, it compresses a small set of data in each possible way, one column at a time. After all possible encodings are performed, COPY makes a comparison. Then whatever encoding results in the most significant compression is defined as the column encoding for that table. You can execute the ANALYZE COMPRESSION SQL command to sample data from an existing table, and do a brute force compression comparison, similar to the process carried out by the COPY's automatic compression analysis. You can then manually set the compression encoding based on your analysis of the results.
Within an Amazon Redshift table, each column can be specified with an encoding that is used to compress the values within each block. In general, compression should be used for almost every column within an Amazon Redshift cluster. You need to load your data into Amazon Redshift before you can query it. Amazon Redshift provides a command called COPY, which is designed for high performance loading of data. Amazon have ensured that the COPY command can be used to quickly load data from other Amazon big data services. You can load data into Amazon Redshift from a range of Amazon big data services, including Amazon S3, Amazon DynamoDB, Amazon EMR, and Amazon Data Pipeline.
You can also connect to Amazon Redshift using ODBC or JDBC, and issuing INSERT SQL commands to insert the data. Note this approach is slower than using the COPY command since this method is optimized to load data in parallel to each compute node, while SQL INSERT statements load via the single leader node. As I mentioned, Amazon Redshift integrates with the other AWS services to enable you to move, transform, and load your data quickly and reliably using secure features. So if you want to move data between Amazon Redshift and Amazon S3, Amazon Simple Storage Service, Amazon S3, is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in the Amazon S3 buckets. You can also use parallel processing to export data from Amazon Redshift to multiple data files on Amazon S3. Amazon DynamoDB is a fully-managed, no SQL database service. You can use the COPY command to load an Amazon Redshift table with data from a single Amazon DynamoDB table. You can also use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, or Amazon EC2 instances.
The COPY command connects to remote hosts using SSH, and execute commands on the remote host to generate data. Amazon Redshift supports multiple simultaneous connections, enabling you to load large volumes of data in parallel. You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and execute re-querying jobs without having to write your own complex data transfer or transformation logic. For example, you can set up a recurring job to automatically copy data from Amazon DB into Amazon Redshift.
When performing the data load, you should compress the files to be loaded wherever possible. Amazon Redshift supports both GZIP and LZ0 compression. It's more efficient to load a large number of small files than one large one. And the ideal file count is a multiple of the slice count. The number of slices per node depends on the node side of the cluster. By ensuring you have an equal number files per slice, you know that copy execution will evenly use cluster resources and complete as quickly as possible. When deciding which approach to take, the key is to determine where you want to process the data.
For example, if you're storing log files in S3, and then loading them into Amazon Redshift, will you transform the semi-structured log files into structured tables, as part of the loading process? Or will you load the log records as a single column, and then post-process them within Amazon Redshift to apply the structure? Or will you load the log files into Amazon DB or Amazon EMR first to find the structure and then use the pre-process data to load structured data into Amazon Redshift? You can connect to Amazon Redshift using industry-standard PostgreSQL ODBC or JDBC connections. The PostgreSQL core Amazon Redshift is based upon enables this open capability.
This means that you can continue to use your third-party ETL and query tools to load data into the query, and query data, from Amazon Redshift. If you do not currently have a business intelligence query tool, you can download a free third-party SQL client tool, such as SQL Workbench for J, which can connect to an Amazon Redshift cluster via JDBC. What you're a bit cognizant of is that unlike traditional RDBMS', which are optimized to take a large number of small, concurrent queries and return responses quickly, Amazon Redshift is designed to handle large volumes of data and large queries. So as we have discussed, Amazon Redshift queues the queries into a slot when the query queue becomes available.
This queue-based paradigm means that the response behavior provided to the query tool is very different from a traditional RDBMS or IdP engine. Amazon Redshift is based on a relational database management system, RDBMS, so it's compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, Amazon Redshift is optimized for high performance analysis and reporting of very large data sets. Amazon Redshift is based on PostgreSQL 8.0.2 Amazon Redshift and PostgreSQL have a number of very important differences that you need to take into account as you design and develop your big data applications. While online transition processing, OLTP applications typically store data in rows, as we have discussed, Amazon Redshift stores data in a columnar format. Some PostgreSQL features that are suited to small-scale OLTP processing, such as secondary indexes, and efficient single row data manipulation operations have been admitted to improve performance.
So Amazon Redshift is not an ideal service for transactional orientated solutions. Uniqueness, primary key, and foreign key constraints are informational only. They are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hands for the query plan, and they should be declared in your loading or processing process, if practical. A number of SQL features are implemented differently in Amazon Redshift compared to PostgreSQL.
For example, the CREATE TABLE command does not support the table space or table partitioning. The ALTER TABLE command does not support the ALTER COLUMN function. Details of the SQL features that differ are of course documented on the online Amazon Redshift documentation. Several data types that are available in PostgreSQL are also not available in Amazon Redshift. These include arrays, some date time types, JSON, and XML. Also, several Postgre functions are not available. These include string functions such as SUBSTRING, END, IS DISTINCT FROM.
A number of features that you would typically find in a traditional data warehouse data repository are also not available, including stored processes, triggers, and sequences. These are not typically a problem when architecting a big data solution as there are a number of other AWS services available to provide similar capabilities. But if you were for example looking to lift and shift the legacy data warehouse to Amazon Redshift, you should be cognizant of the impact of these differences.
There are a number of limits within the Amazon Redshift service you need to be aware of. Amazon Redshift has both default quotas and hard limits on the service, and both of these are important. The default quota is listed in the left-hand table, as set by Amazon globally. You can request that Amazon increases these quotas for a specific account, and a specific region by submitting the required increase form. A hard limit is a limit that is inherent in the Redshift service and cannot be exceeded.
The hard limits are listed in the right-hand table, and the majority of these limitations are inherited from the use of the underlying PostgreSQL core, that Amazon Redshift is based upon. One of the hard limits to watch out for is the 9,900 limit on tables. This limit is calculated including both persistent interior tables, views are not included in the calculation. This limit means you need to be careful when loading data sources that have large volumes of tables, which is often common in a big data scenario. There are a number of use cases when Amazon Redshift is the perfect storage solution, and a number where an alternative Amazon solution would potentially provide a better solution.
Both Amazon Redshift and Amazon RDS enable you to run traditional relation databases in the cloud while offloading database administration. Customers will typically use Amazon RDS databases for both online transactional processing and for reporting and analysis. Amazon Redshift harnesses the scales and resources of multiple nodes and uses a variety of optimizations to provide orders of magnitude improvement over traditional databases for analytic and reporting workloads against very large data sets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows compared to Amazon RDS. Amazon Redshift is ideal for large volumes of structured data that you want to persist and query using standard SQL and your existing BI tools. Alternatively, Amazon EMR is ideal for processing and transforming unstructured, or semi-structured data to bring it into Amazon Redshift, and is also a much better option for data sets that are relatively transitory, or not stored for long-term use.
If you need to run advanced analytical algorithms, then Amazon EMR or Amazon Machine Learning are better solutions than Amazon Redshift. However, Amazon Redshift does allow you to define a scalar, user-defined functions called UDFs. You can created custom scalar functions in Python 2.7, and with these scalar UDFs, you can perform analytics that were previously impossible, or too And welcome to Big Data on AWS. Today we're gonna look at storing data with Amazon Redshift. At the end of this module you will be able to describe in detail how Amazon Redshift can be used to store data within a big data solution. In previous modules, we've covered Amazon S3, DynamoDB, and Amazon RDS, and now we're gonna look at how Amazon Redshift works and how you can use it in some big data scenarios. Amazon Redshift is primarily designed to store data and serve as access to this data via a myriad of business intelligence and query tools. But you can also process data while it's contained within Amazon Redshift via the use of third-party ETL tools, manual tool commands executed within Amazon Redshift, or via user-defined functions written in Python. When processing transformations in Amazon Redshift, it is far more efficient to undertake what is termed ELT, where we extract, load and then transform the data within the database.
This means that the transformations are executed within reach of itself compared to ETL, which is where we extract the data, transform it in third-party tools or engines, and then load the data back into Amazon Redshift. When choosing a big data storage solution from within the AWS service offerings, it's important to determine whether the data sources we are primarily storing contain structured, semi-structured, or unstructured data.
This will typically drive the decision on which AWS service is the best for the data pattern or use case. Amazon Redshift is primarily designed to manage structured data. As well as storing data, Amazon Redshift is also able to process and transform the data within the database.
When choosing a big data processing solution from within the available AWS service offerings, it's important to determine whether you need the latency of response from the process to be in seconds, minutes, or hours. This will typically drive the decision on which AWS service is best for that processing pattern or use case. Amazon Redshift is primarily designed to deliver batch-orientated processing. Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse, and is designed to store and process large volumes of data, and provide access to this data using your existing business intelligence tools.
Amazon redshift is based on a clustered architecture, often referred to as a grid architecture. The core container of an Amazon Redshift data warehouse is called a cluster. A cluster is composed of one or more compute nodes. Compute nodes are the worker nodes, they execute the SQL code that transforms or presents the data. If you have two or more compute nodes, an additional leader node is automatically introduced into your cluster. Your applications will then only communicate with the leader node, and the leader node compiles and distributes the query code to the applicable compute node, to be executed if required. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation, the results of which are then returned to your applications.
This approach allows you to leverage the increased compute power that is gained by adding additional compute nodes to you cluster. Adding additional compute nodes to a cluster is called horizontal scaling, compared to make each compute node larger by adding additional CPUs, memory and disk which is called vertical scaling. These compute nodes are transparent to your application. Your application never sees them. In a multi-node cluster, only the leader node communicates with the compute nodes. Each compute node is partitioned into slices.
Each slice has allocated a part of the computer node's memory and disk. You can think of a slice like a virtual compute node. The leader node manages the distribution of a subset of data to each slice, and also manages allocating queries to the relevant slice which holds the data. This allows queries to be run in parallel on a compute node, reducing the latency by which a result is returned, and more effectively leveraging the CPU, memory, and data available to each compute node. The number of slices available is automatically determined by the size and type by the Redshift compute node you provision.
For example, for dense storage compute nodes, at one times large, extra large, has two slices per node, while on eight times large has 16 slices per node. This is because the eight times extra large obviously has more CPU, memory and disk available, so it can be divided into a larger number of slices. When you create a data table, you can optionally specify one column in this table as the distribution key. When you load data into the table, the rows are distributed to the slices according to the distribution key that is defined. So for example if you have a distribution key based on year, then the data for each year will be distributed across different node slices. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and execute the queries more efficiently, meaning you get better performance. When you run a cluster with at least two compute nodes, data on each node will always be mirrored on disks on another node, and this reduces the risk of incurring data loss if a node becomes unavailable. The mirroring does not use any of the disk that are on the secondary node.
So for example, if you provision a compute node with two terabytes of disk available, any data that is mirrored on this node from another compute node does not reduce the two terabyte available in any way. You can also have a single node Redshift cluster. Of course this cluster will have no data duplication built in as it is a single node. So Amazon recommends a minimum of two nodes for production environments.
An Amazon Redshift cluster can contain one, or many databases. Databases are a way of logically grouping tables, which hold data and can be queried. So for example, you might created a database to hold human resources data, and another database to hold finance data. Some queries are distributed and executed on the compute nodes, and other queries must be executed exclusively on the leader node. This is because some SQL functions are supported only on the leader node, and are not supported on the compute nodes. A query that references only catalog tables, they're the tables that start with a PG prefix, such as PG_table_diff, or that does not reference any tables runs exclusively on the leader node.
An example of that is the CURRENT_SCHEMA function. When a query references user created tables or system tables, tables that have a prefix of an STL, or an STV prefix, and systems views who have an SVL or an SVV prefix, then the leader node will pass these queries to the compute nodes for execution. So effectively, if you're querying about data, that executes on the compute node. If you're querying about system metadata, then it executes on the leader node. Remember, queries that use a leader node function must execute exclusively on their leader node, not on the compute nodes, or you will be returned with an error, and nobody likes errors. So let's have a look at how you provision an Amazon Redshift cluster. As you can see, the first choice is to select your node type, which determines the number of virtual CPUs, the amount of memory and the volume of storage that will be provisioned.
The next choice is the number of nodes in the cluster. And that's pretty much it. Amazon makes standing up your Redshift cluster as easy as any other Amazon service. A number of different types and sizing options exist for your Redshift nodes. There are two types of Redshift nodes; The dense storage node types start with ds, and are optimized for storing large volumes of data.
The dense compute node types, you guessed it, they start with dc, are optimized for faster access to smaller volumes of data. The dense storage node types are optimized for large data workloads and use standard hard disk drives for storage. Within the ds node types, there are four different categories. You can choose between the extra large, and the eight extra large nodes.
The decisions on which to use is driven by the amount of disk you need to have available. Two terabytes, or the extra large 16 terabytes for the eight extra large. There is a limit to how many nodes you can attach to a cluster, which we will discuss soon, so please know the total capacity limit on the table. If you're going to exceed 64 terabytes of storage, then you will need to use the eight extra large nodes.
Also note that unlike other services like EC2s, there are no node sizes between extra large and eight times extra large, you only have the two choices. After disk requirements, the other variable that you need to use to determine which node size to use is the amount of virtual CPUs and memory required. The ds2 options have more virtual CPUs and memory allocated than the ds1 options, so they will have a higher level performance, but of course, they also cost more.
The dense compute dc nodes are optimized for performance as they use solid state drives, SSDs, which give a much higher level of performance due to increased I/O throughput. But as they are using SSDs, they provide a much smaller storage size compared to the HDD nodes. For example, the ds extra large has two terabytes of disk per node available, while the dc extra large only has 164 gigabytes. This typically means you will not use the dc nodes in a production, big data environment, as the data volumes will typically exceed this load. The node type that you choose depends heavily on the amount of data you are loading into Amazon Redshift, the complexity of the queries in operations that you are running, and the response time needs of the downstream applications that the accessing this data.
You can of course transition from smaller node size to a larger node size in the future if required, one of the many benefits of AWS' elestic scalability. One important thing to note is the minimum number of nodes for a cluster using eight times large nodes is two nodes. You cannot provision a single node Amazon Redshift cluster if you are using the eight times large node size. Amazon Redshift achieves extremely fast query execution by employing these performances features.
The massively parallel processing, or MPP, is a grid capability we discussed earlier where you can add additional compute nodes to horizontally scale the environment. This ability to spread the data and the query execution across multiple nodes results in improved performance. Amazon Redshift uses a columnar storage approach when storing the database table information. This columnar approach results in reduced disk I/O requests and reduces the amount of data that is read from the disk when queried. This data being read means more queries can be executed purely in memory on the servers, which improves the query performance.
To get the best of this columnar approach, it is important the query processor is able to rapidly filter out a large subset of the data blocks, which is why setting the best distribution key is so important. Data is compressed on disk, which both reduces the storage requirements, that means the less data is required to be written to memory. Once the compressed data is written to memory, it is automatically uncompressed when accessed by a query. Again, this effective use of memory results in a higher level of performance. You can apply compression encodings manually, but Amazon strongly recommends you use the COPY command to load data, and let it analyze and apply the compression algorithms automatically. Be default, the COPY command applies automatic compression and it balances overall performance when it chooses the compression encodings.
The COPY command requires enough rows in the data, at least 100,000 rows per slice to generate a meaningful sample on which to base the automatic compression analysis. This process is only performed on loading of an empty table. Once they complete, the compression encoding is sent for the table, and when you add more data to the table later, the appended rows are compressed using the existing encoding. Amazon Redshift has a built-in query execution engine which includes a query optimizer that is MPP and columnar aware. This query optimizer provides significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, sub-queries, and aggregation.
The query optimizer both rewrites the query when necessary and generates a query plan to ensure it's executed with the best performance. The query plan specifies execution options such as join types, join order, aggregation options, and data distribution requirements. The leader node distributes compiled code to the compute nodes in the cluster. Compiling the query eliminates the overhead associated when running an interpreter on the computer nodes, and therefore increases the execution speed, especially for complex queries.
The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster. Be careful, though the execution engine compiles different code for the JDBC connection protocol, and for the ODBC connection protocols, two client applications using different protocols will each incur the first time cost of compiling the code. Other clients that use the same protocol however will benefit from sharing the cached code. When users run queries in Amazon Redshift, the queries are routed to query queues.
This is a common approach for MPP styles of data storage. Unlike standard RDBMS data storage platforms, which are designed to receive and immediately execute a large number of small queries from a large number of users at any one time, MPP, with grid style databases, like Amazon Redshift, receive the query and then manage the execution of the query by the use of queues. Queues are effectively racked and stacked. Queries are effectively racked and stacked in a queue until resources are available for the query to be executed. This may result in latency between when the query is submitted, and when it is executed.
One of the hard limits within Amazon Redshift is a maximum of 50 user-defined query queues. Each query queue can be configured to run up to 50 queries concurrently, but the maximum total concurrency level for all user-defined queries, queues, is 50. This means that you are effectively limited to a maximum of 50 concurrent queries at any one time. Due to this, you will sometimes want to manage the priority of queries so that the important queries do not get stuck behind slower running or less important queries. You do this by creating multiple query queues, and then maintaining the properties of these queues to help manage their priorities. You can configure WLM properties for each query queue to specify the way that memory is allocated amongst the slots. How queries can be routed to specific queues at run time, and when to cancel long-running queries.
You can configure the following for each query queue; The concurrency level. Queries in a queue run concurrently until they reach the concurrency level defined for the queue. Subsequent queries then wait in the queue until a previous query completes. As a best practice, Amazon recommends using a concurrency level of 15 or lower for a queue to reduce contention for system resources. User groups. You can assign a set of user groups to a queue by specifying each user group name, or by using wildcards. When a member of a listed user group runs a query, that query runs in the corresponding queue. Query groups. You can assign a query group to a queue, which is simply a label.
When submitting a query, you assign the query group label, and the query will run in the corresponding queue. By default, each user-defined queue has allocated an equal proportion of the memory that is available. So if you have four user-defined query queues, they will each have 25% of the available memory allocated. You can override this to hard code the percentage of memory that is allocated. You can also limit the amount of time that queries in a given WLM queue are permitted to run for. The timeout parameter specifies the amount of time in milliseconds that Amazon Redshift waits for a query to execute before canceling the queue.
The timeout is based on query execution time, and doesn't include the time spent waiting in a queue. Of course, canceling a query in the middle of its execution is never a pleasant thing for a user to experience, so it's probably better to route their queries to a dedicated long run query queue instead. By default, Amazon Redshift configures a super user in a default user queue. The default queue is initially configured to run five queries concurrently. You can change the concurrency, the timeout, and the memory allocation properties for the default queue, but you cannot specify user groups or query groups for their default queue. Any queries that are not routed to other queues run in a default queue.
There are a number of tuning options you can manage within Amazon Redshift to improve or tailor, performance-based on the type of data you are dealing with. These include distribution styles, sort keys, and compression encodings. Defining a distribution style determines on which slice the table the data is stored. It's similar to partitioning in a standard relational database, but not exactly the same. When you create a table, you designate one of three distribution styles; Even, key, or all. Even will do a round robin distribution of data. Even distribution is the default distribution style. Key requires a single column to be defined as a disk key.
On load, Amazon Redshift hashes each disk key column value and routes hashes to the same slice consistently. And all distribution stores a full copy of the table on the first slice of each node. All distribution multiplies the storage required by the number of nodes in a cluster, and so it takes much longer to load, update, or insert data into multiple tables. All distribution is appropriate only for relatively slow moving tables, that is, tables that are not updated frequently or extensively. Small dimension tables do not benefit significantly from all distribution, because the cost of redistribution is low, and the data is readily available on every node. From a performance point of view, generally it's best to strive for key whenever appropriate. Choose all in the scenarios where it makes sense, and key doesn't. Only choose even when neither key nor all is appropriate.
Defining a table with a sort key results in a physical ordering of data within each slice. Amazon Redshift allows for a table to be defined with compound sort keys, interleaved sort keys, or no sort keys. Each of these styles of sort keys is useful for certain table access patterns. You can define or more of its columns as sort keys. A compound key is made up of all the columns listed in the sort key definition in the order they are listed. An interleaved sort key gives equal weight to each column, or a subset of columns in the sort key. No sort keys is what it means, there is no sort key defined. A compound sort key is more efficient when query predicates user prefix, which is a subset of the sort key columns in order.
The performance benefits of compounding sort keys decrease when queries depend only on secondary sort columns, without referencing the primary columns of the sort key. An interleaved sort key gives equal weight to each column in the sort key, so queries can use any of the subsets of the columns, and make up the sort key in any order. In practice, a compound sort key is more appropriate for over 90% of the workloads.
An interleaved sort is more effective with large tables. Data compression in database systems isn't new. However, historically it was used to reduce data footprint rather than boosting performance, because of the expensive decompression overhead.
In Amazon Redshift, using column encodings translate to both a data footprint reduction and a performance gain, because the cost of decompression is negligible relative to the reduction of disk I/O. The COPY command automatically performs automatic compression analysis before data ingestion. This process first samples a subset of the data to be loaded and then performs a brute force compression comparison. To do so, it compresses a small set of data in each possible way, one column at a time. After all possible encodings are performed, COPY makes a comparison.
Then whatever encoding results in the most significant compression is defined as the column encoding for that table. You can execute the ANALYZE COMPRESSION SQL command to sample data from an existing table, and do a brute force compression comparison, similar to the process carried out by the COPY's automatic compression analysis. You can then manually set the compression encoding based on your analysis of the results. Within an Amazon Redshift table, each column can be specified with an encoding that is used to compress the values within each block.
In general, compression should be used for almost every column within an Amazon Redshift cluster. You need to load your data into Amazon Redshift before you can query it. Amazon Redshift provides a command called COPY, which is designed for high performance loading of data. Amazon have ensured that the COPY command can be used to quickly load data from other Amazon big data services. You can load data into Amazon Redshift from a range of Amazon big data services, including Amazon S3, Amazon DynamoDB, Amazon EMR, and Amazon Data Pipeline. You can also connect to Amazon Redshift using ODBC or JDBC, and issuing INSERT SQL commands to insert the data.
Note this approach is slower than using the COPY command since this method is optimized to load data in parallel to each compute node, while SQL INSERT statements load via the single leader node. As I mentioned, Amazon Redshift integrates with the other AWS services to enable you to move, transform, and load your data quickly and reliably using secure features. So if you want to move data between Amazon Redshift and Amazon S3, Amazon Simple Storage Service, Amazon S3, is a web service that stores data in the cloud.
Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in the Amazon S3 buckets. You can also use parallel processing to export data from Amazon Redshift to multiple data files on Amazon S3. Amazon DynamoDB is a fully-managed, no SQL database service. You can use the COPY command to load an Amazon Redshift table with data from a single Amazon DynamoDB table. You can also use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, or Amazon EC2 instances. The COPY command connects to remote hosts using SSH, and execute commands on the remote host to generate data.
Amazon Redshift supports multiple simultaneous connections, enabling you to load large volumes of data in parallel. You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and execute re-querying jobs without having to write your own complex data transfer or transformation logic. For example, you can set up a recurring job to automatically copy data from Amazon DB into Amazon Redshift. When performing the data load, you should compress the files to be loaded wherever possible. Amazon Redshift supports both GZIP and LZ0 compression. It's more efficient to load a large number of small files than one large one. And the ideal file count is a multiple of the slice count. The number of slices per node depends on the node side of the cluster.
By ensuring you have an equal number files per slice, you know that copy execution will evenly use cluster resources and complete as quickly as possible. When deciding which approach to take, the key is to determine where you want to process the data. For example, if you're storing log files in S3, and then loading them into Amazon Redshift, will you transform the semi-structured log files into structured tables, as part of the loading process? Or will you load the log records as a single column, and then post-process them within Amazon Redshift to apply the structure? Or will you load the log files into Amazon DB or Amazon EMR first to find the structure and then use the pre-process data to load structured data into Amazon Redshift?
You can connect to Amazon Redshift using industry-standard PostgreSQL ODBC or JDBC connections. The PostgreSQL core Amazon Redshift is based upon enables this open capability. This means that you can continue to use your third-party ETL and query tools to load data into the query, and query data, from Amazon Redshift. If you do not currently have a business intelligence query tool, you can download a free third-party SQL client tool, such as SQL Workbench for J, which can connect to an Amazon Redshift cluster via JDBC. What you're a bit cognizant of is that unlike traditional RDBMS', which are optimized to take a large number of small, concurrent queries and return responses quickly, Amazon Redshift is designed to handle large volumes of data and large queries.
So as we have discussed, Amazon Redshift queues the queries into a slot when the query queue becomes available. This queue-based paradigm means that the response behavior provided to the query tool is very different from a traditional RDBMS or IdP engine. Amazon Redshift is based on a relational database management system, RDBMS, so it's compatible with other RDBMS applications.
Although it provides the same functionality as a typical RDBMS, Amazon Redshift is optimized for high performance analysis and reporting of very large data sets. Amazon Redshift is based on PostgreSQL 8.0.2 Amazon Redshift and PostgreSQL have a number of very important differences that you need to take into account as you design and develop your big data applications. While online transition processing, OLTP applications typically store data in rows, as we have discussed, Amazon Redshift stores data in a columnar format.
Some PostgreSQL features that are suited to small-scale OLTP processing, such as secondary indexes, and efficient single row data manipulation operations have been admitted to improve performance. So Amazon Redshift is not an ideal service for transactional orientated solutions. Uniqueness, primary key, and foreign key constraints are informational only. They are not enforced by Amazon Redshift.
Nonetheless, primary keys and foreign keys are used as planning hands for the query plan, and they should be declared in your loading or processing process, if practical. A number of SQL features are implemented differently in Amazon Redshift compared to PostgreSQL.
For example, the CREATE TABLE command does not support the table space or table partitioning. The ALTER TABLE command does not support the ALTER COLUMN function. Details of the SQL features that differ are of course documented on the online Amazon Redshift documentation.
Several data types that are available in PostgreSQL are also not available in Amazon Redshift. These include arrays, some date time types, JSON, and XML. Also, several Postgre functions are not available. These include string functions such as SUBSTRING, END, IS DISTINCT FROM. A number of features that you would typically find in a traditional data warehouse data repository are also not available, including stored processes, triggers, and sequences. These are not typically a problem when architecting a big data solution as there are a number of other AWS services available to provide similar capabilities.
But if you were for example looking to lift and shift the legacy data warehouse to Amazon Redshift, you should be cognizant of the impact of these differences. There are a number of limits within the Amazon Redshift service you need to be aware of. Amazon Redshift has both default quotas and hard limits on the service, and both of these are important. The default quota is listed in the left-hand table, as set by Amazon globally. You can request that Amazon increases these quotas for a specific account, and a specific region by submitting the required increase form.
A hard limit is a limit that is inherent in the Redshift service and cannot be exceeded. The hard limits are listed in the right-hand table, and the majority of these limitations are inherited from the use of the underlying PostgreSQL core, that Amazon Redshift is based upon. One of the hard limits to watch out for is the 9,900 limit on tables. This limit is calculated including both persistent interior tables, views are not included in the calculation. This limit means you need to be careful when loading data sources that have large volumes of tables, which is often common in a big data scenario.
There are a number of use cases when Amazon Redshift is the perfect storage solution, and a number where an alternative Amazon solution would potentially provide a better solution. Both Amazon Redshift and Amazon RDS enable you to run traditional relation databases in the cloud while offloading database administration. Customers will typically use Amazon RDS databases for both online transactional processing and for reporting and analysis. Amazon Redshift harnesses the scales and resources of multiple nodes and uses a variety of optimizations to provide orders of magnitude improvement over traditional databases for analytic and reporting workloads against very large data sets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows compared to Amazon RDS.
Amazon Redshift is ideal for large volumes of structured data that you want to persist and query using standard SQL and your existing BI tools. Alternatively, Amazon EMR is ideal for processing and transforming unstructured, or semi-structured data to bring it into Amazon Redshift, and is also a much better option for data sets that are relatively transitory, or not stored for long-term use. If you need to run advanced analytical algorithms, then Amazon EMR or Amazon Machine Learning are better solutions than Amazon Redshift. However, Amazon Redshift does allow you to define a scalar, user-defined functions called UDFs. You can created custom scalar functions in Python 2.7, and with these scalar UDFs, you can perform analytics that were previously impossible, or too
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.