Big Data Storage
The course is part of this learning path
Course two of the Big Data Specialty learning path focuses on storage. In this course, we outline the key storage options for big data solutions. We determine data access and retrieval patterns, and some of the use cases that suit particular data patterns such as evaluating mechanisms for capture, update, and retrieval of catalog entries. We learn how to determine appropriate data structure and storage formats, and how to determine and optimize the operational characteristics of a Big Data storage solution.
Amazon Aurora is now MySQL and PostgreSQL-compatible.
- Recognize and explain big data access and retrieval patterns.
- Recognize and explain appropriate data structure and storage formats.
- Recognize and explain the operational characteristics of a Big Data storage solution.
This course is intended for students looking to increase their knowledge of the AWS storage options available for Big Data solutions.
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud storage solutions. Our courses on AWS storage fundamentals and AWS database fundamentals will give you a solid foundation for taking this present course.
This Course Includes
- Over 90 minutes of high-definition video.
- Real-Life Scenarios using AWS Reference Architecture
What You'll Learn
- Course Intro: What to expect from this course.
- Amazon DynamoDB: How you can use Amazon DynamoDB in Big Data scenarios.
- Amazon DynamoDB Reference Architecture: A real-life model using DynamoDB
- Amazon Relational Database Service: A look at how Amazon RDS works and how you can use it in Big Data scenarios.
- Amazon Relational Database Service Reference Architecture: A real-life model using RDS.
- Amazon Redshift: An overview of Amazon Redshift works and how you can use it in Big Data scenarios.
- Amazon Redshift Reference Architecture: A real-life model using Redshift.
Welcome to Big Data on AWS, Storing Data on Amazon RDS. At the end of this module, you'll be able to describe in detail how Amazon RDS can be used to store data within a big data solution. In the previous modules, we've covered DynamoDB. Now we're gonna look at how Amazon RDS works and how you can use it in big data scenarios. Amazon RDS is primarily designed to store data and service access to this data to a myriad of applications, business intelligence, and query tools.
Amazon RDS provides the options to utilize a number of different relational database engines which we will outline later in this module. The key to Amazon RDS is that it provides a relational database capability which provides some benefits and some limitations when applied in the big data space. Given its relational database foundation, Amazon RDS is inherently focused on providing storage for transactional style applications and for reporting applications that access smaller data volumes and simpler data structures when compared to other Amazon big data services.
When choosing a big data storage solution from within the available Amazon service offerings, it's important to determine whether the data sources we are storing data from primarily contain structures, semi-structured, or unstructured data. This will typically drive the decision on which Amazon big data service is the best for that data pattern or use case. Amazon RDS is primarily designed to manage structured data. As well as storing data, Amazon RDS is also able to process and transform the data within the database. When choosing a big data processing solution from within the available Amazon service offerings, it is important to determine whether you need the latency of your response from the process to be in sub seconds, seconds, minutes or hours.
This will typically drive the decision on which AWS service is the best for that processing patten or use case. Amazon RDS is primarily designed to deliver transaction orientated storage and processing but it's important to remember it is not designed to process very large volumes of data. Amazon RDS allows you to easily set up, operate, and scale a relational database in the cloud. It provides cost efficient and resizable capacity while managing time consuming database administration tasks, freeing you up to focus on your applications and business. It is primarily designed to be used as a transactional database underpinning applications but it also provides a flexible and managed repository for reporting data where storage volumes are below six terabytes. Amazon RDS also has the ability to create read replicas that allow you to offload reporting from the update database instance, enabling you to scale out your reporting environment.
Amazon RDS is based on a platform as a service style architecture where you determine the size of the capacity you require and the architecture and components are automatically provisioned and stored and configured for you. You have no need or ability to change the way these architectural components are deployed. The core container of an Amazon RDS environment is called a DB instance. When you provision a DB instance you choose the database engine you wish to utilize.
For example, MySQL or Oracle. We will discuss the options for these database engines in a few minutes. The provision DB instance and the selected database engine is then composed of a compute capability and a storage capability. When you provision an Amazon EC2 instance, you get CPU, memory, storage, and IOPS all bundled together. With Amazon RDS, these are split apart so that you can scale them independently. So, for example, if you need more CPU, less disk performance or more storage you can easily allocate them separately within the DB instance you provision.
The CPU and memory configuration for a DB instance are still predefined like an EC2 instance and that you cannot independently select the CPU and memory being allocated. The size of the compute capability provision is defined in the terms of number of virtual CPUs and the amount of memory allocated. You choose this combination by choosing a DB instance class, for example, an M4 large when you provision the DB instance. These DB instance classes are very similar to the EC2 instance types and there is often a consistent ratio between memory and virtual CPU.
One thing to watch out for is that not all DB instance classes are available for all database engines. For each DB instance, you also need to select the associated storage capacity. Each DB instance class has a minimum and a maximum storage requirement for the DB instances that are created from it. The Amazon RDS service uses dedicated Amazon elastic block storage or EBS as the underlying storage. This is the same storage system that's used by the EC2s.
The Amazon RDS DB instance can contain one or many databases. Databases are a way of logically grouping tables which hold data and can be queried. So for example, you might create a database to hold human resources data and another database to hold finance data. So, let's have a look at how you provision an Amazon RDS DB instance. As I've already mentioned, when you provision your Amazon RDS DB instance, the first thing you need to do is to link the specific database engine that is used within that database instance. Amazon RDS provides a number of database engines that are designed to store and manage relational data.
The Amazon Aurora database engine is a MySQL compatible relational database engine developed by Amazon. The popular open source MariaDB, MySQL, and PostgreSQL relational database are also available as database engines. The Microsoft SQL Server database engine uses the SQL server relational database developed by Microsoft. You can deploy SQL server 2008R2, 2012, 2014, and 2016 versions, as well as the express web standard or enterprise editions. The Oracle database engine uses the Oracle relational database developed by Oracle. You can deploy Oracle 11g or 12c, as well as Oracle standard or enterprise editions. Once you have selected the database engine you then choose the DB instance class which determines the amount of virtual CPU and memory you have allocated to the DB instance.
You can also choose the storage type and the amount of storage you require. While there are a number of components under the covers for Amazon RDS architecture, you can see how Amazon has made it simple to get started as always. When you buy an EC2 instance, you get CPU and memory storage and IOPS bundled together. With Amazon RDS, these are split apart so you can scale them independently. So for example, if you need more CPU and less IOPS or more storage, you can easy allocate them. Amazon RDS allows you to influence the level of performance you achieve by selection of virtual CPU, the memory and IOPS that underpin your RDS servers. The compute and memory capacity of a DB instance is determined by its DB instance class. DB instance classes are similar to the EC2 instance types. The DB instant class you need depends on your processing power and memory requirements. T
here are DB instance classes that support by bursted database access and sustained access. Amazon recommended good practice for RDS performance is to allocate enough RAM so that you're working set resides almost completely in memory. IOPS stand for input/output operation per second. IOPS can be compared to the revolutions per minute of a car engine, also known as RPM, where the bigger the number, the faster the expected performance.
So, if for example, a car engine rotating at 2,000 RPM means a car is typically going slower than a car with an engine rotating at 6,000 RPM. The same concept applies with IOPS. The bigger the number, then in theory, the greater the level of performance. Amazon RDS allows you to influence the level of performance you achieve by the selection of IOPS that underpin your RDS service. Amazon RDS allows you to position one of three storage types, general purpose SSD, provisioned IOPS, and magnetic.
Magnetic storage is available for backwards compatibility, but Amazon recommend general purpose or provision IOPS for any new services you provision. Provisioned IOPS is designed to support high performance transactional workloads. For database workloads with moderate IO requirements, use the general purpose storage. For very small database workloads with infrequent IO you could use the magnetic storage. Amazon RDS general purpose storage is suitable for a broad range of database workloads that have moderate IO requirements with a baseline of three IOPS per gigabyte and the ability to burst up to 3,000 IOPS, this storage option provides predictable performance to meet the needs of most applications.
Amazon RDS provision IOPS is an SSD back storage option designed to give a fast, predictable, and consistent IOPS performance. With provision IOPS, you specify an IOPS rate when creating a DB instance, and Amazon RDS provisions the IOPS rate for the lifetime of the DB instance. Amazon RDS provision IOPS is optimized for IO intensive transactional database workloads and is the most applicable option for big data services, as the large volumes of data often require a large level of disk performance. Depending on the amount of storage you provision Amazon RDS will automatically stripe your data across multiple EBS volumes to improve the IOPS performance.
If you used provisioned IOPS storage to gain a high level performance, you'll need to use the M4, M3, R3, and M2 DB instance classes. These instance classes are optimized for provision IOPS storage. There's no point paying for extra IOPS and not getting the full value of that investment.
The ratio of IOPS to storage for your DB instances where storage is measured in gigabytes should be between three to one and 10 to one, so let's have a look at what that means. The IOPS to storage ratio also effects the performance of your Amazon RDS environment. Amazon recommend the ratio of IOPS to storage in gigabytes for your DB instances should be between three to one and 10 to one, and in fact, when provisioning your DB instance you'll be forced to provision IOPS that are in at least the three to one ratio to the size of storage you provision. For example, you could start by provisioning an Oracle database instance with 1,000 IOPS and 200 gigabytes of storage, achieving a ratio of five to one, then you could scale your IOPS to 2,000 to improve performance, retaining the current 200 gigabytes of storage which will result in a ratio of 10 to one.
Then, you could add some more storage capacity, increasing it from 200 gigabytes to 500 gigabytes but retaining the current 2,000 IOPS which will reduce the ratio to four to one. Finally, you can increase this instance to the maximum for an Oracle DB instance of 30,000 IOPS with six terabytes of storage which will result in a ratio of five to one. Be aware that SQL Server database engines have a different maximum IOPS and storage limits. There are a number of ways you can scale your Amazon RDS environment out to help manage the constraints that are inherent when using RDS with the typical big data volumes of data.
You can implement a data petitioning strategy. You can get high availability with a primary instance and a synchronous secondary instance that you can file over to when problems occur. You can also use MySQL, MariaDB or PostgreSQL read replicas to increase read scaling. If your application requires more compute resources than the largest DB instance class or more storage than the maximum allocation, you can implement partitioning, thereby spreading your data across multiple DB instances. RDS does not really offer anything to help you with this so you're partitioning approach of management of this capability will need to be implemented and managed within your application layer.
Amazon RDS provides high availability and fail over support for DB instances using multiple availabilities on deployments. Multi AZ deployments for Oracle, PostgreSQL, MySQL, and MariaDb use Amazon technology to replicate the data while SQL Server DB instances use SQL Server mirroring. In a Multi AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different availabilities zone. In the event of a planned or unplanned outage of your DB instance, Amazon RDS automatically switches to a standby replica in another availability zone if you have enabled Multi AZ.
The fail over mechanism automatically changes the DNS records of the DB instance to point to the standby instance. Note the synchronous standby replicate capability is not a scaling solution for read only scenarios. You cannot use a standby replica to serve read traffic. It is effectively a warm standby. To have active access to this data you will need to use read replicas. If you are using MySQL, MariaDB, and PostgreSQL database engines for your DB instance, Amazon RDS can use the built in replication functionality of these engines to create a special type of DB instance called a read replica. Updates made to the source DB instance are synchronously copied to the read replica.
You can reduce the load on your source DB instance by routing re-queries from your applications to the read replica. Using the read replica capability means you can elastically scale out for read heavy database workloads. If you use the MySQL database engine, Amazon RDS allows you to add table indexes directly to the read replicas without those indexes being present on the master database, meaning you can turn your read replicas for high-volume read behavior while leaving your master database tuned for high volume load behavior. Securing access to your Amazon RDS data is important.
Security groups are used to control the traffic that has access in and out of a DB instance. Three types of security groups are used with an Amazon RDS, database security groups, VPC security groups, and EC2 security groups. DB security groups allow access from EC2 security groups in your AWS account or other accounts. You do not need to specify destination port number when you create DB security group rules. The port number defined for the DB instance is used as the destination port number for all rules defined for the DB security group. VPC security groups allow access from other CVPC security groups in your VPC only. When you create rules for your VPC security group that allow access to the instances in your VPC you must specify a port for each range of the addresses that the rule allows access for. AWS recommend that you run your database instances in an Amazon VPC, which allows you to isolate your database in your own virtual network and connect to on premise IT infrastructure using industry standard encrypted IPSec VPNs.
You can configure firewall settings and control network access to your database instances. By default, networks access is turned off to a DB instance and you then specify security rules in a security group than enables access from a range of IP addresses or EC2 security groups. Amazon RDS allows you to encrypt your databases using keys you manage through the AWS management key service, or KMS, to ensure data is encrypted at rest.
On a database instance running with Amazon RDS encryption, data is stored at rest and the underlying data storage is encrypted as are its automated backups, its read replicas, and its snapshots. Amazon RDS also supports transparent data encryption in SQL Server and Oracle. Transparent data encryption in Oracle in integrated with the AWS CloudHSM which allows you to securely generate, store and manage your cryptographic keys in a single tenant hardware security module appliance within the AWS cloud. Amazon RDS supports the use of SSL to secure data in transit. You can use SSL from your application to encrypt a connection to your database. Each DB engine has its own process for implementing SSL. Amazon RDS is integrated with the AWS identity and access management solution, IAM, and provides the ability to control the actions that users can take on a specific Amazon RDS resource from database instances through to snapshots, parameter groups and option groups. You can also tag your Amazon RDS resource and control the actions that your IAM users and groups can take on groups of resources that have the same tag and associated value.
For example, you can configure your IAM rules to ensure developers are able to modify development database instances, but only database administrators can make changes to production database instances. You need to load your data into Amazon RDS before you can access it or query it. You can connect to Amazon RDS using ODBC or JDBC and issue standard and SQL commands to insert the data. This means you can use a third party ETL tool such a Talent or your current ETL tool if you have one to load data into your RDS DB instance. You can also use the AWS database migration service.
This helps migrate the data from your on premise database to the target Amazon RDS DB instance. You can undertake a like for like database migration where the source and target database engines are the same or are compatible like to Amazon RDS for Oracle. Since the schema structure database types and database code are compatible between the source and the target database this kind of migration is a one step process, or you can change the database engine being used as part of the database migration. For example, from Oracle to Amazon RDS for MariaDB. In this scenario, this is a two step process.
First, we use the AWS Schema Conversion Tool to convert the source schema and code to match that of the target database, and then we use the AWS database migration servers to migrate data from the source database to the target database. There are also a number of database engine specific load options. For example, Amazon RDS supports native backup and restore for Microsoft SQL Server databases using the full back up files, the .back files.
You can create a full backup of your on premise SQL server database and store it on S3 and then restore the backup files into an existing RDS DB instance running SQL server. You can connect to Amazon RDS using industry standard ODBC or JDBC connections. The database engine that Amazon RDS DB instance is based upon enables this capability. It's one of the benefits you gain from Amazon utilizing open and mature relational database engine as the core of its RDS service.
This means that you can continue to use your third party ETL query and reporting tools to load data into and to query data from Amazon RDS. You can also continue to use the database engine specific tools that you currently use. For example, you can use SQL plus to connect to your Oracle DB instance in Amazon RDS or Microsoft Management Studio to connect to the SQL Server database engine. If you're connecting to the database from a client machine you will need to ensure the required ports are open in your client side firewall. The port will be different depending on what database engine you have selected from your database instance. One area to be cognizant of is that traditional database relational database engine, RDS is optimized take a large number of small concurrent queries and return responses quickly rather than large volumes of data and large queries.
This transactional RDBMS space paradigm means the response behavior provided to the query is very different from a service such as Amazon Redshift or Amazon EMR. There are a number of limits within the Amazon RDS service you need to be aware of. When these limits have been reached, attempts to add additional resources will fail with an exception. The default limits listed in the tables are set by AW and are forced each AWS region. So for example, you can create 40 DB instances in the US one region and 40 DB instances in US east two region.
One of the other key things to note is that Amazon RDS does not provide shell access to your database instances. It will show restricts access to certain system procedures and tables that require advanced provisions so you should always test that the database features and capabilities you are using are still enabled in Amazon RDS if you are migrating from an on premise database environment. One of the other limits to be aware of is the limit on the maximum storage that is available within a DB instance.
These limits are listed in the table. Please note that SQL Server has a lower storage limit than the other database engines. You are able to provision your DB instance with a small storage capacity and increase the Amazon RDS storage available as required up to these limits. You can modify a DB instance to use additional storage and you can convert to a different storage type after it's been provisioned. Except for the SQL Server database engine, which cannot have its storage capacity changed or its storage type changed due to the limitations of the stripe storage attached to a window server environment for this database engine. If you want to exceed these storage limits you will need to implement a petitioning strategy to store more data.
Also, keep in mind you're also limited to 100 terabytes of storage for Amazon RDS within a region. These limits mean Amazon RDS is not ideal as your primary big data storage servers when large volumes of data are expected to be persisted, but Amazon RDS is often used as a service component in an end to end Amazon big data solution. There are a number of big data use cases where Amazon RDS is the perfect storage solution and a number where an alternate Amazon solution would potentially provide a better solution.
Amazon RDS is perfect for running transactional relational database applications in the cloud while offloading database administration, customers using Amazon RDS databases for both online transactional processing, OLTP, and for reporting and analysis purposes. Amazon RDS provides great performance for transactional queries and as we have outlined you can scale this performance up and down with different database instance types and IOPS settings, but if you need a varied load latency response to queries then Amazon DynamoDb is a better solution. Amazon RDS is ideal for storing structured data that you want to persist and query using standard SQL and your existing BI tools. Amazon EMR is ideal for processing and transforming unstructured or semi-structured data and also is a much better option for data sets that are relatively transitory and not stored for long-term use. If you need to run analytics algorithms, then Amazon EMR or Amazon Machine Learning are better solutions than Amazon RDS.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.