1. Home
  2. Training Library
  3. Amazon Web Services
  4. Amazon Web Services Courses
  5. Solution Architect Professional for AWS - Domain One - High Availability, Scalability and Business Continuity

RDS back up and restore and self healing capabilities


Advanced High Availability
Setting the Scene
43m 44s
Start course
3h 31m

Course Description

In this course, you'll gain a solid understanding of the key concepts for Domains One and Seven of the AWS Solutions Architect Professional certification: High Availability, Scalability and Business Continuity. 

Course Objectives

By the end of this course, you'll have the tools and knowledge you need to successfully accomplish the following requirements for this domain, including:

  • Demonstrate ability to architect the appropriate level of availability based on stakeholder requirements.
  • Demonstrate ability to implement DR for systems based on RPO and RTO.
  • Determine appropriate use of multi-Availability Zones vs. multi-Region architectures.
  • Demonstrate ability to implement self-healing capabilities.
  • Demonstrate ability to implement the most appropriate data storage scaling architecture
  • High Availability vs. Fault Tolerance.
  • Scalability and Elasticity.

Intended Audience

This course is intended for students seeking to acquire the AWS Solutions Architect Professional certification. It is necessary to have acquired the Associate level of this certification. You should also have at least two years of real-world experience developing AWS architectures. 


As stated previously, you will need to have completed the AWS Solutions Architect Associate certification, and we recommend reviewing the relevant learning path in order to be well-prepared for the material in this one. 

This Course Includes

  • 1 hour and 13 minutes of high-definition video.
  • Expert-led instruction and exploration of important concepts. 
  • Coverage of critical concepts for Domain one and Domain Seven of the AWS Solutions Architect - Professional certification exam. 

What You Will Learn

  1. Designing a back-up and recovery solution.
  2. Implementing DR based on RTO/ RPO.
  3. RDS back up and restore and self healing capabilities.
  4. Points to remember for the exam.

Amazon RDS is a key service in designing highly available, self healing services. In this lesson, we'll explore some of the backup, recovery, and replication features of RDS, that are relevant to the Solution Architect Professional Exam. AWS has many options for databases, you can run your own database on an EC2 instance, or use one of the managed service data based options provided by the Amazon Relational Database Service. If you are running your own database on a EC2 instance, you can backup data to files using native tools, or create a snapshot of the volumes containing the data using EBS snapshot based protection. Using database replica backups for databases that are built on RAID sets of Amazon EBS volumes, you can remove the burden of backups on the primary database by creating a read replica. Now that's an up-to-date copy of the database that runs on a separate Amazon EC2 instance. The replica database instance can be created using multiple discs similar to the source, or the data can be consolidated to a single EBS volume. You can then use the EBS snapshot protection to snapshot the EBS volumes. This approach is often popular for large databases that are required to run 24/7. AWS Relational Database Service offers a reliable infrastructure for running your own database in multiple availability zones. Amazon RDS will keep your databases up to date with the latest patches. You can exert optional control over when your instance is patched. CloudWatch metrics offer detailed monitoring for RDS. RDS also makes read replicas possible. RDS databases can notify you via email, or SMS, of database events through Amazon SNS, that's Simple Notification Services. So you can use the AWS management console, or the Amazon RDS API's to subscribe to over 40 different database events associated with your database instances. The automated backup feature of RDS enables point in time recovery for your database instances. Amazon RDS will back up your database and transaction logs, and store both for a user-specifed retention period. So this allows you to restore your database instance to any second during your retention period, up to the last five minutes. Your automatic backup retention period can be configured to up to 35 days. Database snapshots are another benefit. So, it snapshots a user initiated backup of your instances, stored in S3, and they're kept until you explicitly delete them. You can create a new instance from a database snapshot, load database snapshot serve operationally as full backups, you're only billed for the incremental storage use. The other great benefit is Multi-AZ deployments. So Amazon RDS Multi-AZ deployments provide availability, and durability for database instances. When you provision a Mulit-AZ database instance, Amazon RDS synchronously replicates your data to a stand-by instance in a different availabilty zone. Another benefit is you get automatic host replacement. So Amazon RDS will automatically replace the compute instance powering your deployment in event of hardware failure. With the Single-AZ database failure, a user initiated point in time restore operation is still required to restore a database. This operation can take several hours to complete, and any data updates that occurred after the latest restorable time, will not be available. So Multi-AZ DB's enhance your durability and your availability. Multi-AZ deployments for the postgreSQL, mySQL, and Oracle engines utilized synchronous physical replication to keep data on the standby up to date with the primary. If a storage volume on your primary fails in a Multi-AZ deployment, Amazon RDS automatically initiates a failover to the stand by. So if you have enabled Multi-AZ, Amazon RDS automatically switches to a standby replica in another availability zone in the event of an outage of your DB instance. The time it takes for the failover to complete depends on the database activity, and other conditions at the time of the primary DB instance becomes unavailable. Failover times are typically 60-120 seconds. However, large transactions or a lengthy recovery process, can increase that failover time. When the failover is complete, it can take additional time for the RDS console to reflect the new availability zone. So when failover occurs, the failover mechanism automatically changes the DNS record of the DB instance to point to a standby DB instance. When failing over, Amazon RDS changes the CNAME for your DB instance to point to the standby, which is in turn, promoted to become the new primary. The primary DB instance switches over automatically if the availability zone becomes unavailable, if the primary DB instance fails, or the DB instance server type is changed. Or if the operating system of a DB instance is undergoing software patching. You can also implement a manual failover of a DB instance, using the reboot with failover flag. If you initiate a failover when rebooting your instance from AWS management console, or if you're using the API, use the reboot instance API core. Now the creation of a standby at the synchronous replication and the failover itself, are all handled automatically. RDS is a manage service. Your standby is automatically provisioned in a different availability zone of the same region as your DB instance primary. This means you can't select the availability zone your standby is deployed into, or alter the number of standby's available. RDS provisions one dedicated standby per DB instance primary. Most important, also, is to remember that the standby cannot be configured to accept database connections, so you can't read it as a read replica. You can create a new read replica, and when you do this, you can select which AZ the read replica is created in, so you can create another layer of durability to your design, which is ideal for read-heavy database workloads like news sites, or publishing sites. Keep in mind that you need to enable automatic backups on your DB instance before adding any read replicas. You can do this by setting the backup retention period to a value other than zero. Backups must remain enabled for read replicas to work. If you're running a Multi-AZ deployment, automatic backups in DB snapshots are taken from the standby to avoid I/O suspension on the primary. Now you still might experience increased I/O latency for one or two minutes during backups on both the single AZ and multi AZ deployments. Initiating a restore operation, a point in time restore, or a restore from DB snapshot also works the same with Multi-AZ deployments as with standard Single-AZ deployments. New DB instance deployments can be created with either the restored DB instance from snapshot API core, or the restore DB instance to point in time API core. Any new DB instance deployment can be either standard or Multi-AZ, regardless of whether the source backup was initiated on a standard or Multi-AZ deployment. Another reason for reason for using Multi-AZ deployments is to reduce the impact of planned maintenance and backups. In the case of system upgrades, like OS patching or DB instance scaling, these operations are applied first on the standby, prior to the automatic failover. As a result, your availability impact is only the time required for automatic failover to complete. Unlike Single-AZ deployments, I/O activity is not suspended on your primary during backup for a Multi-AZ deployments, for the MYSQL Oracle, and postgreSQL engines. Because the backup is taken from the standby, you do need to keep in mind, that you still might experience elevated latencies for a few minutes during backups for Multi-AZ deployments. Amazon RDS allows you to encrypt your database using keys you manage to AWS Key Management Service, or KMS, and on a database instance running with Amazon RDS encryption, data stored rests, and the underlying storage is encrypted, as are it's automatic backups, read replicas, and snapshots. RDS is integrateable with AWS identity and access management, and provides you with the ability to control the actions that your AWS IAM users and groups can take on specific Amazon RDS resources. That can be from database instances, through to snapshots, perimeter groups, and even your option groups. So you can also tag your Amazon RDS resources, and control the actions that your IAM users and groups can take on groups of resources, that have the same tag. Instance types with RDS are general purpose SSD, and a provisioned iOP SSD. So the general purpose SSD is Solid-State Drive backed storage delivering a consistent baseline of three iOPS per provision gigabyte. And it does provide the ability to burst up to 3,000 iOPS, so it's suitable for a broad range of database workloads. The provisioned iOPS SSD storage is designed to deliver really fast, really predictable, and consistent I/O performance for those larger database workloads. Another key thing to remember about RDS is their Maintenance Window. So RDS performs maintenance on RDS resources for you, it's a manage service. So the required patching is automatically scheduled for patches that are related to security and instance reliability. So if there's some patch that needs to be made to an Oracle database or a SQL server database, and it's affecting security or instance reliability, AWS will do that immediately. Now for other types of patches, if you don't specify a preferred weekly maintenance window, when you create your DB instance, a 30 minute default value is assigned. So maintenance items require that Amazon RDS take your DB instance offline for a short time. Now if you want to change when maintenance is performed on your behalf, you can do so by modifying your DB instance in the management console, or using the modify DB instance API. Now each of your DB instances can have different preferred maintenance windows. Changes to a DB instance can occur when you manually modify a DB instance, such as when you upgrade a DB instance version, or when Amazon RDS performs maintenance on an instance. So how does that work in Multi-AZ environments? When you're running a DB instance as a Multi-AZ deployment, it does reduce the impact of a maintenance event. RDS will conduct maintenance using the following steps. Perform maintenance on the standby first, promote the standby to primary, and then perform maintenance on the old primary, which then becomes the new standby. Now, for DB instance updates, you can choose to upgrade a DB instance when a new DB instance is supported by Amazon RDS. Each DB engine has different criteria for upgrading an instance, and what DB engine versions are supported. So when you modify the database engine for your DB instance, in a Multiple AZ deployment, then RDS upgrades both the primary, and secondary DB instances at the same time. So in that case, the database engine for the entire Multi-AZ deployment is shut down during the upgrade. All right, so Amazon RDS Best Practices. Always monitor memory and CPU and storage using CloudWatch notifications. Those can notify when usage patents change, or when you approach the capacity of your deployment. So that way you can maintain system performance and availability. Enable automatic backups, and set the backup window to occur during the daily low in right iOPS, if you have one. Scale up your DB instances when you're approaching storage capacity limits. You should aim to have some buffer in storage and memory, to accommodate unforseen increases in demand from your reps. Now on a mySQL DB, try not create, or do not create more than ten thousand tables, using provisioned iOPS, or a thousand tables using standard storage. Large numbers of tables will significantly increase database recovery time after a failover or database crash. Also on mySQL DB's, avoid tables on your database growing too large. So underlying file system constraints do restrict the maximum size of a mySQL table to two terabytes. So instead of having a large table, partition your tables so that file sizes are well under the two terabyte limit. If your database workload requires more I/O than you're provisioned, recovery after a failover, or database failover, will be slower. So how do you increase the I/O capacity of a DB instance? Here are a few options. First, you can migrate to a DB instance class with high I/O capacity. You can convert from standard storage to provisioned iOPS storage, and use a DB instance class that is optimized for provisioned iOPS. If you're already using provisioned iOPS storage, provision additional throughput capacity. Also, if your client application is caching the DNS data of your DB instances, set a time to live value of less than 30 seconds. Caching the DNS data for an extended time can lead to connection failures if your application tries to connect to an IP address, and that's been changed with the failover. So a few RDS security Best Practices? Right, first off, don't use the AWS root credentials to manage RDS resources. Use AWS IAM accounts to control access to RDS API actions, especially actions that create, modify, or delete RDS resources. Assign an individual IAM account to each person who manages RDS resources. Grant each user the minimum set of permissions required to perform his or her duties. And use IAM groups to effectively manage permissions for multiple users. And remember to rotate your IAM credentials regularly. Let's review a sample question. Okay, your company's on premise content management system has the following architecture: An application tier of Java code on a Jboss application server, a Database tier of Oracle database regularly backed up to Amazon Simple Storage Service, S3, using the Oracle, RMAN backup utility. Our Static Content is stored on a 512 gigabyte Gateway stored Storage Gateway volume attached to the application server via an iSCSI interface. Okay, which AWS based disaster recovery strategy will give you the best RTO? So we're looking for the best way to recover as quickly as possible. Now let's work from the bottom to top for a change, with these options. So option D, deploy the Oracle database and the Jboss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Restore the static content from an AWS Storage Gateway-VTL running on Amazon EC2. So Gateway-VTL stands for Gateway-Virtual Tape Library. So the first part of this option reads okay, yes, we can install Oracle on EC2 using a BYO license. And restoring the Oracle DB via RMAN presents a viable way to restore the Oracle database if required. I can't see any restraint with using a JBoss server on EC2. However, the option to restore the static content from VTL is a bit questionable. Yes you can run the Storage Gateway appliance, or VM appliance on EC2 when deploying the Virtual Tape Library and Gateway cached versions of AWS Storage Gateway. And yes, if we got it up and running for the static content, it would present itself as and iSCSI interface, but only as a virtual tape! A Gateway-VTL exposes several tape drives in a media changer, referred to collectively, as VTL Devices, and that's the ISCSi targets. The VTL interface lets you leverage your existing tape based backup application of the structure, to store data on virtual tape cartidges that you create on your Gateway-VTL. Each Gateway-VTL is pre-configured with a media changer and a tape drive, which are available to your existing client backup applications as iSCSI devices. You add tape cartridges as you need to archive your data. Now we were told that the static content is currently stored on premise, on a 512 gigabyte Gateway stored storage Gateway volume. So restoring the static content from the VTL interface is not really a match. If we assume nothing on premise will change in this new AWS design they're proposing, how will we access the Gateway volume from the new Gateway-VTL? Well, more importantly, even if there was a way to do that, as there probably is, it just isn't going to be fast enough. So, I'm not liking this option. Let's go to option C. Deploy the Oracle database and the JBoss app server on EC2. Yep. Restore the RMAN Oracle backups from Amazon S3. Sounding good. Restore the static content by attaching an AWS Storage Gateway running on Amazon EC2 as an iSCSI volume to the JBoss EC2 server. Okay, so option C has the same setup as option D, except this time we're given another option to attached a Gateway stored iSCSI drive to the JBoss server. So first off, it is possible to do this. The iSCSI interface presents itself as either Linux or Windows, as a mappable drive, however it's not quite that easy. While all this is potentially possible, it's not a very practical way of mapping and getting our system back to it's working state as quickly as possible. So in theory, you could, however it's a very clunky way to return a system to it's working state. Even if the timing it guaranteed was lightning fast. Okay so let's have a look at option B. Deploy the Oracle database on RDS. Mmm-hmm, interesting. This is the first option we've had that's proposed using Oracle and RDS. So normally, that would be a good option, because straight away we get a manage service, which has automatic backups, etcetera, and of course, may give us some more durability in how the Oracle databases is run. And creating a EBS snapshot for the static content is a nice, easy way to attach our static content, so I love that, it's certainly a lot less cumbersome than trying to map the iSCSI drive we had proposed in option C, which I'm not saying is impossible, but it's not as streamlined as what we have proposed for option B. The problem we have with option B is that Oracle RDS doesn't support, currently, RMAN. So we can't, literally, use the RDS stack for this scenario. So, we're going to have to skip past option B, unfortunately! Option C, deploy the Oracle database and the JBoss app server on EC2. Okay. Restore the RMAN Oracle backups from Amazon S3. Good. Like it. Generate an EBS volume of static content from the Storage Gateway and attach it to the JBoss EC2 server. Okay, now this looks much better. RMAN is going to work backing from Amazon S3 will be quite quick. And taking an EBS snapshot, and attaching it as a volume is just a far simpler way of restoring that static content. All in all, out of the options we have, I think option A looks the best way to do this, and achieve the lowest possible recovery time objective.

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.