Advanced High Availability
In this course, you'll gain a solid understanding of the key concepts for Domains One and Seven of the AWS Solutions Architect Professional certification: High Availability, Scalability and Business Continuity.
By the end of this course, you'll have the tools and knowledge you need to successfully accomplish the following requirements for this domain, including:
- Demonstrate ability to architect the appropriate level of availability based on stakeholder requirements.
- Demonstrate ability to implement DR for systems based on RPO and RTO.
- Determine appropriate use of multi-Availability Zones vs. multi-Region architectures.
- Demonstrate ability to implement self-healing capabilities.
- Demonstrate ability to implement the most appropriate data storage scaling architecture
- High Availability vs. Fault Tolerance.
- Scalability and Elasticity.
This course is intended for students seeking to acquire the AWS Solutions Architect Professional certification. It is necessary to have acquired the Associate level of this certification. You should also have at least two years of real-world experience developing AWS architectures.
As stated previously, you will need to have completed the AWS Solutions Architect Associate certification, and we recommend reviewing the relevant learning path in order to be well-prepared for the material in this one.
This Course Includes
- 1 hour and 13 minutes of high-definition video.
- Expert-led instruction and exploration of important concepts.
- Coverage of critical concepts for Domain one and Domain Seven of the AWS Solutions Architect - Professional certification exam.
What You Will Learn
- Designing a back-up and recovery solution.
- Implementing DR based on RTO/ RPO.
- RDS back up and restore and self healing capabilities.
- Points to remember for the exam.
A company typically decides on an acceptable business continuity plan based on the financial impact to the business when systems are unavailable. So the company determines the financial impact by considering many factors such as the loss of business and the damage to its reputation due to downtime or the lack of systems availability. Now the common metrics for business continuity, commonly referred to Recovery Time Objective and the Recovery Point Objective. So let's just delve into these two concepts. So Recovery Time Objective, or the RTO, is the time it takes after a disruption to restore a business process to its service level as was defined by the operational level agreement. So for example, if a disaster occurs at 12 o'clock lunch time, and the RTO is eight hours, then the disaster recovery process should restore the business process to the acceptable service level by 8 PM. Now the Recovery Point Objective, or RPO, is the acceptable amount of data loss measured in time. So just to confuse everyone, it is also a time value, but it's a slightly different one. The two are quite different concepts. So the acceptable amount of data loss measured in time. So for example, if that disaster occurred at 12 o'clock around lunchtime and the RPO is one hour, the system should recover all data that was in the system before 11 AM. So the data loss will spend only one hour between 11 AM and 12 PM. So they're quite different, aren't they? Like the Recovery Point Objective is what's the last point in the data can we successfully absorb if there is an outage? And for a highly transactional business, that's going to be extremely low. I mean, even having an hour of data loss, if you're dealing with customer transactions, is not gonna be acceptable to a transactional business. And that's going to impact how we design our systems to be as highly available and as fault-tolerant as possible for a transactional business like that. And another scenario might be that the business can absorb some outage that does need to have the systems up and running again as soon as possible. So the RTO might be the priority. And part of your business continuity planning needs to be to define what is the priority? The Recovery Time Objective, i.e. how quickly we can get the system back up and running again so it can answer queries and requests and be fully functional; or is it the Recovery Point Objective that's our priority, that we must be able to go back to the last possible point in time without any data loss? So there's a number of different scenarios that we can apply in AWS to help meet the RPOs and RTOs, and the first one is what we call backup and restore. Now with backup and restore, data is stored as a virtual tape library using AWS Storage Gateway or another network appliance or similar nature. We can use import and export, AWS Import and Export, to shift large archives or in setting up archives for a backup and restore scenario. Then in a disaster, archives are recovered from Amazon S3 and restored as if we were using a virtual tape. Now we need to select the appropriate tools and methods to backup our data into AWS. Three things to keep in mind. First, ensure that you have an appropriate retention policy for this data. So how long we're gonna keep these virtual tape archives for. Is it six months, is it a year, is it five years, what are the commercial and compliance requirements, et cetera? The second is to ensure that the appropriate security measures are in place for this data, including the encryption and access policies. So can we guarantee that where it's been stored is going to be suitably secure? And third, we need to make sure that we regularly test the recovery of the data and the restoration of the system. All right, so the second potential design is what we call pilot light. And in pilot light, data is mirrored and the environment is scripted as a template, which can be built out and scaled in the unlikely event of a disaster. And a few steps that we need to go through to make pilot light work, first, we set up Amazon EC2 instances to replicate or mirror our data. Second, we ensure that we have all supporting custom software packages available in AWS. So that can be quite an operational overhead to ensure that we have all of the latest and greatest custom software packages that we need for our environment available in AWS. And third, we need to create and maintain Amazon machine images of the key service we use where fast recovery is required. And then fourth, we need to regularly run these servers, test them and apply any software updates and configuration changes to ensure that they're going to match what our production environment currently is in the event of a disaster. And then fifth, we need to consider automating the provisioning of AWS services as much as possible with CloudFormation. So what that looks like in our recovery phase, so in the unlikely event of a disaster, to recover the remainder of our environment around our pilot light, we can start our systems from the Amazon machine images on the appropriate instance types. And for our dynamic data servers, we can resize them to handle production volumes as needed or add capacity accordingly. So basically, horizontal scaling is often the most cost-effective and scalable approach to add capacity to the pilot light system. As an example, we can add more web servers at peak times during the day. However, we can also choose larger Amazon EC2 instance types and thus scale vertically for applications such as our relational databases and files storage, for example. And any required DNS updates can be done in parallel. Okay, so the third scenario we can implement is what we call warm stand by. And our key steps for preparation in a warm stand by, which is as it says, essentially ready to go with all key services running in the most minimal possible way. So first, we set up our Amazon EC2 instances to replicate or mirror data. Secondly, we create and maintain Amazon machine images as required. Third, we run our application using a minimum footprint of AWS EC2 instances or AWS infrastructure. So it's basically the bare minimum that we can get by with. And forth, we patch and update software and configuration files in line with our live environment. So we're essentially running a smaller version of our full production environment. Then during our recovery phase, in the case of failure of the production system, the standby environment will be scaled up for production load, and the DNS records will be changed to route all traffic to the AWS environment. Now our fourth potential scenario is what we call multi site. With multi site, we set up our AWS environment to duplicate our production environment. So essentially, we've got a mirror of our production running in AWS. Firstly, we set up DNS waiting or a similar traffic routing technology if we're not using Route53 to distribute incoming requests to both sites. We also configure automated failover to reroute traffic away from the affected site in the event of an outage. Now in our recovery phase, traffic is cut over to the AWS infrastructure by updating the DNS record and Route53. And all traffic and supporting data queries are supported by the AWS infrastructure. Our multi site scenario is usually the preferred one and where time is a priority, recovery time and recovery point time priorities and costs are not the main constraint, then that would be the ideal scenario. Okay, so one key thing to ensure when we're running any of our scenarios is to ensure that we test the recovered data. So once we restore our primary site to a working state, we then need to restore to the normal service, which is often referred to as a fall back process. So depending on your DR strategy, this typically means reversing the flow of data replication so that any data updates received while the primary site was down can be replicated back without loss of data. Here's the first for backup and restore. First, we freeze the data changes on the DR site. Second, we take a backup. Third, we restore the backup to the primary site. Fourth, we re-point users to the primary site. And five, we unfreeze the changes. With pilot light, warm stand by and multi site, first we establish reverse mirroring and replication from the DR site back to the primary site once the primary site has caught up with the changes. Second, we freeze data changes to the DR site. And then third, we re-point users to the primary site. And then finally, we unfreeze the changes. So a sample question here reads: an ERP application, which stands for enterprise resource planning, an enterprise resource planning application is deployed in multiple availability zones in a single region. In the event of failure, the RTO must be less than three hours, and the RPO is 15 minutes. The customer realizes that data corruption occurred roughly one and a half hours ago. Which disaster recovery strategy can be used to achieve this RTO and RPO in the event of this kind of failure? Okay, so this is a sample question provided by AWS to help us prepare. So the first step we want to do is highlight the key requirements that are given to us. First off, we have an application. There are no dependencies defined. E.g., there's no mention of specific AWS services such as Amazon RDS, AWS Direct Connect or Amazon Storage Gateway. Well, we are told it's an ERP application. We can probably safely assume for now that it's a standalone service. The service runs in multiple availability zones in a single region. These points may become relevant to say we are asked to improve performance or reduce cost. Back of mind for now. The RPO or Recovery Point Objective is 15 minutes, and the Recovery Time Objective is three hours. But before we dive any deeper, let's ensure we understand the question so we can evaluate the options we're being provided. The question is asking us which DR strategy can be used to achieve this RTO and RPO in the event of this kind of failure? So if we go back to our disaster recovery white paper or our own little disaster recovery wall chart, let me just pull it up here, we might recall we have roughly four approaches to disaster recovery. There's backup and restore, there's pilot light, warm stand by and multi site. Now this customer has experienced an issue roughly an hour and a half ago, so this question is, I'm thinking, more about point in time recovery. Now that Recovery Point Objective is a red light low. It's not sirens blaring low. It's achievable, but we need a reason to make granular backup strategy to provide that type of recovery point. Regular point in time snapshots should work fine here for backup as long as the restore process can be completed in that very short timeframe to meet that three-hour Recovery Time Objective. So ideally, we'll be looking for archives of say under a terabyte. Otherwise, we might struggle to restore an entire system from an archive with this low RTO objective. In the scenario, we're achieving the RPO and the RTO, so that's probably my primary consideration right now. How do we ensure we can go back to a point in time no longer than 15 minutes since that corruption occurred within three hours? How we manage our backups is crucial to this scenario. We'll recall that synchronous replication of multiple AZ databases is a good solution for recovery time as it replicates our DB. However, we still need a backup strategy to recover a system to a point in time. So back to our options. Our point in time recovery is dependent on our backup strategy. So I think we can eliminate option B from this scenario. It's not gonna help us with this point in time recovery. Multi-AZ databases significantly improve the availability of the system. However, they don't necessarily make it any easier to recover to a point in time. As a managed service, Amazon RDS enables automatic backups, which can work within five minutes of the last day to write. So that is a brilliant feature and well worth deploying if you're using RDS. But multi-AZ deployments don't provide any magic bullet for how we restore to a specific point in time. So wait on, didn't we just say that RDS has automated backups? Yes it does. So let's just make sure we're clear on the difference between automated backups and snapshots, and why the difference between them might be relevant for this scenario. The automated backup feature in Amazon RDS enables a point in time recovery of your database instance. When automated backups are turned on for your DB instance, Amazon RDS automatically performs a full daily snapshot of your data, which is done during your preferred backup window. And the catch is a transaction loss as updates to your DB instance are made. So when you initiate a point in time recovery, transaction logs are applied to the most appropriate daily backup in order to restore your DB instance to the specific point in time you requested. Amazon RDS retains backups of a database instance for a limited user-specified period of time, which is called the retention period. Now by default, that is one day, but you can set it for up to 35 days. You can initiate a point in time restore and specify any second during your retention period, up to the last restorable time. And you can use the describe DB instance's API to return the latest restorable time from your DB instances. Now that's typically within the last five minutes. Alternatively, you can find the latest restorable time for a DB instance by selecting it in the AWS Management Console and looking in the Description tab of the lower panel of the console. So DB snapshots are, on the other hand, user initiated and enable you to backup your DB instance to a known state as frequently as you wish and then restore to that specific state at any time. Db snapshots are kept until you explicitly delete them with the console or the DeleteDBSnapshot API. The snapshots which Amazon RDS performs for enabling automated backups are available to you for copying using the AWS console or the RDS copy DB snapshot command, or before the snapshot restore functionality. You can identify them using the automated snapshot type. In addition, you can identify the time at which the snapshot has been taken by viewing the snapshot created time field. Alternatively, the identifier of the automated snapshot also contains the time and a UTC value at which the snapshot has been taken. So let's think about this for a minute. If we've been told early on that RDS was being used, we could assume that automated backups would be turned on and that we might be able to use them as a way of recovering to within 15 minutes, within three hours in this scenario. But this option just says turning on multi-AZ databases. Now that doesn't automagically do anything that is going to make it any easier for us to restore to a point in time. So again, it's coming down to what options we're given in the question. And because we haven't been given enough information to just assume that automated backups will provide us with this, we're going to have to work on the assumption that we're gonna be using a snapshot of some sort or our own backup service. So we're gonna skip past this option. So let's work on the smaller option set and see what we've got left here. Now option A. This looks feasible at first, but on closer reading it doesn't quite meet our requirements. Now the 15-minute DB backups and five-minute frequency on transaction logs is in line with our Recovery Point Objective. The problem with this option is it says backup to Glacier. Glacier isn't going to enable us to restore quickly as we have a two to three-hour wait time to recover an archive from Glacier. Now if the RTO objective will say eight hours, we could consider this scenario as viable. If our DB backup was S3, the option could look even better. So while the frequency of backups and the option to go to a recovery point is viable, the actual storage mechanism suggested isn't. Okay, so option C, take hourly DB backups to Amazon S3 with transaction log stored in S3 every five minutes. Okay, so that's a plausible restore solution. First of all, we are backing up to S3, which means a considerably faster restore time than that of the one we would experience if we're using Amazon Glacier. So Amazon S3 would help us achieve our Recovery Time Objective. The frequency of snapshots is not ideal. With an hourly restore and five-minute transaction logs however, we could create a database rebuild and restore to within five minutes of the identified corruption time. Of course, this is not the ideal solution. Closer to ideal would be the outline given in option A, but where the archives were stored in Amazon S3 rather than on Amazon Glacier. However, exam options stand on their own merit, so this may not be the best option. It is simply the better of the options thus far. Now two other services that could have made this option even better would be one, if we were told that we were actually using Amazon RDS and we were using a multi-availability zone version of it because that would give us automatic backups, as well as more durability. Multi-AZ means no interruption for patches or platform updates, et cetera. And a second option that would've improved our choices here would be if the customer had AWS Storage Gateway in place. And if we're using Gateway cache volumes, potentially we could get to keep archives available while backing up or restoring before, during or after the outage. Okay, so we've got one more option left, option D, which is take hourly DB backups to an Amazon EC2 instance store volume, with transaction logs stored in Amazon S3 every five minutes. Okay, so hopefully you have picked up what is wrong with this option already. The backup frequency is acceptable for our Recovery Time Objective. However, the backup media proposed is not. So EC2 instance store volumes have ephemeral storage, and instance store volumes are deleted when an instance store backed instance is stopped or terminated. So this option won't persist any actual archive for us. So it's certainly not gonna work for our backup or restore solution. So that's probably a bit of a red herring, option D. Okay, so yeah, looking at our options, probably option C would be the best of the bad bunch.
About the Author
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.