Managing Data Warehouse with Redshift


Using EC2 for Databases
Managing Data Warehouse with Redshift

Databases are among the most used applications in the cloud (or anywhere else, for that matter). Managing data is exactly what computers were invented for, so it should come as no surprise that a great deal of attention is focused on the many different Database Management Systems and Data Management tools that are available.

This second part of our mini series covering AWS databases is about some of Amazon's advanced solutions. You will learn more about Redshift, the solution for massive petabyte-scale data warehouse, Elasticache, the Redis- and Memcached-based solution for in-memory cache, and SimpleDB, an easy alternative for NoSQL databases.

Who should take this course

For this beginner course, you'll require no special prerequisites. Nevertheless, some experience with Databases and at least a basic knowledge of the related jargon might be helpful. If you are completely new to the cloud, you might benefit from our introduction to cloud computing course. You might also find the AWS general introduction course interesting if you are not yet that familiar with the AWS cloud platform.

If you want to test your knowledge of the basic topics covered by this course, we strongly suggest you take our quiz questions. And of course, the first part of this course is a must if you want to learn more about the two major DB services on AWS: RDS and DynamoDB


Amazon Redshift is a fast, fully managed, petabyte scale data warehouse service that makes it simple and cost effective to officially analyze all your data using your existing business intelligence tools. Amazon Redshift is an SQLdata warehouse and uses industry standard ODBC and JDBC connections, and PostgreSQL drivers. AWS recommends Amazon Redshift for customers who need high performance at scale as data and query complexity grows, desire to prevent reporting and analytic processing from interfering with the performance of OLTP workloads, have large volumes of structured data to persist in query using standard SQL and existing BI tools, desire to offload the administrative burden of running one's own data warehouse and dealing with set up, durability, monitoring, scaling, and patching. The Amazon Redshift service manages all of the work of setting up, operating, and scaling a data warehouse. These tasks include provision capacity, monitoring, and backing up the cluster, and applying patches and upgrades to the Amazon Redshift engine. An Amazon Redshift data warehouse is a collection of computing resources called nodes which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases. An Amazon Redshift cluster is a set of nodes which consists of a leader node and one or more compute nodes. The type and number of compute nodes that you need depends on the size of your data, the number of queries you will execute, and the query execution performance that you need. There are two types of nodes. Dense storage, and dense compute. The dense storage computes are recommended for substantial data storage needs, while dense compute node types are optimized for performance and intensive workloads. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. The node limit determines how many nodes are available to the cluster.

Initially, there is a limit of 16 nodes per account in each region. Some node sizes allow single node clusters while other node sizes require multi node clusters. If you have a single node cluster, the node is shared for leader compute node functionality. If you have a multi node cluster, the leader node is separate for the compute node or nodes. A leader node receives queries from client applications, parses the queries, and develops execution plans which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute node, aggregates the intermediate results from these nodes, and finally returns the results back to the client applications. Compute nodes execute the steps specified in the execution plans and transmit data among themselves to execute these queries. As mentioned before, the number of nodes that you choose depends on the size of your data set and your desire queries performance.

Amazon Redshift creates one database when you provision a cluster. This is the database you use to load data and run queries on your data. You can create additional databases as needed by running an SQL command. When you provision a cluster, you specify a master user who has access to all of the databases that are created within that cluster. This master user is a super user who is the only user with access to the database initially though this user can create additional super users and other users. Amazon Redshift uses parameter groups to define the behavior of all databases in a cluster such as, date, presentation style, and floating point precision.

If you don't specify a parameter group when you provision your cluster, Amazon Redshift associates a default parameter group with the cluster. For our first step, we want to create a cluster in Amazon Redshift. You need to sign in to the AWS management console and open the Amazon Redshift console in the main menu and select the region in which you want to create the cluster. For this example, we choose EU, Ireland. Then, in the left navigation pane, click Clusters and then click Launch Cluster. On the Cluster Details page, in the Cluster Identifier box, type a unique name for your cluster.

As you can see, we typed "Test cluster". In the database name box, type a name for the first database that will be created for your cluster. If you leave this box empty the database will be named "Dev".

Now you need to define the port of your database. In the Database Port box, type a database port through which you'll connect to the database instance. The port number must be included in the connection string when opening JDBC or ODBC connections to the database in the cluster. The port number must fall in the range of 1150 to 65535. The default is 5439. You cannot change the port after launching the cluster so make sure that you have an open port number in your firewall so you can connect from SQL client tools to the database in the cluster. In Master Username and the Master User Password boxes define a username and a password to use for connecting to your database. You can go to the next page now by clicking on Continue. In the Node Configuration page you need to configure your nodes. From the Node Type list you can select a node according to your needs. For each one, by choosing CPU Memory Storage and IO Performance, information will be shown in the corresponding node that you select.

For this example we choose DW2.large. Next, choose the type of cluster that you want to use between single node and multi node from the Cluster Type list. Single mode clusters consist of a single node which performs both leader and compute functions. But by choosing multi node, compute nodes store your data and execute you queries. In addition to your compute nodes, as mentioned before, a leader node will be added to your cluster free of charge. The leader node is the access point for ODBC or JDBC and generates the query plans executed on the compute notes.

In this example, the type of cluster is a single node, so we have just one compute node. On the Additional Configuration page you can provide the additional configuration details. In the first part of this page you have Cluster Parameter Group. When you provision an Amazon Redshift cluster, you associate a parameter group with it. The parameters in the group apply to all of the databases you create in the cluster. The Amazon Redshift engine provides a default parameter group which has preset values for the parameters. You cannot change the values in the default parameters group. If your application requires different parameter values, you can create your own parameter group. Creating a parameter group is independent of creating clusters.

You can create a group before you create a cluster. You can associate a parameter group with a cluster when you create it or you can accept the default parameter group. As you can see in this example, a default parameter group will be associated with the cluster. Then, you can choose to encrypt all data within the cluster and add backups at a small cost to performance. In this case, if you have a hardware security module, or HSM, and Client's Certificate configured in Amazon Redshift, you can use them instead of Amazon Redshift. In the Configuration Network Options section of the page, you'll see different options depending on your AWS account which determines the type of platform the cluster uses.

If you have a default VPC in the region you selected, you'll use the EC2-PC platform to launch your cluster, as you can see here. On the other hand, you can choose Not NVPC option and then in that case, you need to define the EC2 availability zone and cluster security groups. As you can see, our AWS account has a default VPC which has been chosen according to this VPC. The default cluster subnet group is used as well. If you want the cluster to be accessible from the public internet, choose "Yes" for Publically Accessible Parameter. Then you need to define a public IP address for your cluster's VPC but if you want it to be accessible only from within your private VPC network, choose "No". Also, you can define your preferred availability zone the cluster will be created in from the Availability Zone list. We do not yet have any preference so we choose No Preference. Optionally, you can associate your cluster with one or more of your security groups by choosing them from the VPC Security Groups list. You can modify the part later by creating a new VPC security group and assigning it to your cluster. We'll do that in the next step, as you'll see. Under Optionally Create a Basic Alarm For This Cluster, set Create Cloud Watch Alarm to "Yes" if you want to create an alarm that monitors the disk usage of your cluster and then, in Disk Usage Threshold, select a percentage that you want the alarm to trigger at when that amount of average disk usage is reached or exceeded. In Use Existing Topic, select "No" if you want to create a new Amazon SNS topic for this alarm. So, in Topic, edit the default name if necessary. Under Recipients, type the email addresses for any recipients who should receive the notification when the alarm triggers or select "Yes" if you want to select an existing Amazon SNS topic for this alarm.

As you can see, we create a new Amazon SNS topic for this cluster. By clicking on continue, on the Review page, you can see the specification of the cluster that you want to launch. Remember that some cluster properties such as database port and master username cannot be modified later.

If you need to change them, use the back button to change them now. If everything is correct you can click on Launch Cluster. A confirmation page appears and the cluster will take a few minutes to finish. Click Close to go to the list of clusters. On the Clusters page, click the cluster that you just launched and review the cluster status information. Make sure that the cluster status is "Available" and the Database Healthy is set to "Healthy" before you try to connect to the database later. So, you've launched your Amazon Redshift cluster but to connect to it we need to configure a security group to authorize access. If you launched your cluster in the EC2VPC platform, you should configure a VPC security group. And if it was launched in the EC2 classic platform, you need to configure an Amazon Redshift security group. As you'll probably remember, in this example we launched our cluster in the EC2VPC platform and used the default VPC security group so to have to correct access to our cluster, we need to configure the security group. For this reason, in the Amazon Redshift console, in the Navigation plane, click Clusters and choose the cluster which you created before. We click on Test Cluster to open it and make sure that we are on the configuration tab. Under Cluster Properties for VPC Security Groups, click View VPC Security Groups. By doing this procedure, you'll go to the Amazon VPC console Security Groups panel. Here, you can modify the default VPC security group or create a new security group. In this example, we'll create a new VPC security group for our Amazon Redshift by clicking on Create Security Group.

It is possible that you saw this procedure in another course but here we're going to explain it one more time. You need to define a name for your new security group in the Group Name Box. Also, if you want, you can provide a description for that in the description box. Then, from the VPC list, you should select your preferred VPC as you can see on the screen. Click "Yes Create" to create a new VPC security group. Now, from the list of security groups, choose the newly created VPC security group and from its details panel click on "Inbound Rules" tab. Then click on "Edit" to modify its inbound rules. For the Type parameter, choose Custom TCP Rule and for the Port Range parameter, type the same port number that you used when you launched the cluster. The default port for Amazon Redshift is 5439 and if you remember, we used this one when we launched the cluster. For source parameter, type Using this is not recommended for anything other than demonstration purposes because it allows access from any computer on the internet.

In a real environment, you would create inbound rules based on your own network settings. Now click on Save. Now you need to assign this new VPC security group to your cluster by modifying it. So, open Amazon Redshift console again and in the Navigation pane click Clusters and then click the cluster that you want to modify.

On the Configuration tab of the Cluster Details page, click Cluster, and then click Modify. Now, in the Modify Cluster window, in VPC Security Groups, choose the new VPC security group that you want to create and then click Modify. Now you can connect to your cluster. Now, we want to connect to our cluster by using an SQL client tool and run a simple query to test the connection.

You can use most SQL client tools that are compatible PostgreSQL, but for this example we'll use the SQL Workbench/J Client. Note that after installing the SQL Workbench, you need to download a JDBC driver that will enable SQL Workbench to connect to your cluster. Currently, Amazon Redshift recommends version 8JDBC driver.

First of all, we need to get the connection strength. For this step in the Amazon Redshift console, in the Navigation Pane, click Clusters. Then we choose our cluster, Test Cluster, and click on the Configuration tab. In this tab, under Cluster Database Properties, you can see the JDBC URL of the cluster. Copy this address. Remember that the end point for your cluster is not available until the cluster is created and in the available state. Now we launch SQL Workbench on our client computer. Click File and then click Connect Window. In the Select Connection Profile window, in the first box, type a name for the connection profile such as Test Cluster. For the Driver box, select PostgreSQL, org. PostgreSQL.driver. If you're using SQL WorkbenchJ for the first time, you might see a message asking you to edit the driver definition. If so, click "Yes" in the dialog box and then Manage Drivers dialog box in the Library Box, type or browse to the path of the driver file you downloaded, as you can see on the screen. Paste the JDBC URL that you copied in the previous step into the URL box then type your username and password of the master account that you specified when you created the cluster in the specific boxes. Then select the Auto-Commit checkbox and finally click okay. If you're having trouble connecting you might be having a problem with your firewall configuration. The port you determined in the perquisites must be open in the firewall to allow the connection between the client tool and the database in the cluster.

To test your connection in the Statement One window, type Select Current User and execute the query. You should see the username of the master account as the result. Okay, up to now we have a database and we are connected to it. Now we'll create some tables in the database, upload date to the tables, and try a query. For your convenience, the sample data will as available in Amazon S3 buckets. We will choose a bucket in the same region we created our cluster. To copy the sample data, we'll need AWS account credentials, that is an access key ID and secret access key. Only authenticated users can access this data. First of all, we need to create tables. This code creates table in the database. We write and execute it in SQL WorkbenchJ. Now we can copy the sample data from Amazon S3 to our database. We use this code and execute it in our SQL WorkbenchJ but you must replace the "Your Access Key ID" and "Your Secret Access Key" with your own credentials and "Region Specific Bucket Name" with the name of a sample bucket in the same region as you cluster. In this table, you have the name of all Amazon S3 sample buckets according to their regions.

Now we try some example queries by this code where we executed in SQL WorkbenchJ. After execution of this code, we can go to Amazon Redshift console to review the queries we executed. For this step, from the Cluster list in the right pane, choose "Test Cluster" then click on Queries tab.

As you can see, the console displays a list of queries we executed over a time period we specify. By default, the console displays queries that have executed in the last 24 hours including currently executing queries. In the list of queries, by selecting a query, we can find out more about it.

The information appears in a new Query tab.

About the Author

Computer Engineer and Cloud Expert