1. Home
  2. Training Library
  3. Designing for Disaster Recovery & High availability in AWS - Level 2

Architecting with SLAs in Mind

Start course

This course covers the core learning objective to meet the requirements of the 'Designing for disaster recovery & high availability in AWS - Level 2' skill

Learning Objectives:

  • Analyze the amount of resources required to implement a fault-tolerant architecture across multiple AWS availability Zones
  • Evaluate an effective AWS disaster recovery strategy to meet specific business requirements
  • Understand SLA for AWS services to ensure the high availability of a given AWS solution
  • Analyze which AWS services can be leveraged to implement a decoupled solution

Let’s consider a simple architecture for a small company. The company wants to measure their uptime percentage for two reasons: 

  1. They want to commit to a percentage of uptime and broadcast that to their users 

  2. And they want to track how they’re performing against a goal, so they can ideally improve their availability over time. 

Let’s say this company runs a traditional three-tier app. They have a web tier, an app tier, and a database tier. The web tier is made up of multiple EC2 instances across multiple AZs. The app tier also is made up of multiple EC2 instances across multiple AZS and the database tier uses a multi-AZ RDS cluster. 

With this architecture, the company is introducing redundancy of components both at the EC2 instance level and at the Availability Zone level, which increases their availability level overall. So in this case, I can use the Region-level EC2 SLA which commits to at least 99.99% uptime. For both the web tier and app tier, I can put 99.99%. 

And I know from looking at the SLA documentation earlier, that the multi-AZ RDS Cluster currently has a 99.95% availability SLA. 

If you want the total availability of this system, you multiply the availability of each component. So, in this case, it would be 99.99% * 99.99% * 99.95%, in which case we get 99.93% after moving the decimal back a few places to accommodate the percentages. 99.93% makes sense because the availability of the whole system can’t be greater than the least available tier, which means we can’t have a higher number than 99.95% for this architecture. Ultimately, this number is really good as it provides an expectation of only about 22 minutes of downtime per month. 

Of course, this 99.93% only really covers the infrastructure - the software, application code, or any deployment processes aren’t included in this - so you’ll need to factor in the availability of those components as well. However, it’s worth noting that your software and applications have a dependency on the infrastructure it runs on - so your application can’t be more available than the infrastructure it runs on - in this case, can’t be more available than 99.93%. 

Let’s compare this to a single EC2 instance for the web tier, and a single EC2 instance for the app tier with a single instance RDS database. We lose the redundancy at the instance and availability level, so our availability metrics go down. Let’s see by how much. The instance-level SLA for EC2 is at least 99.5%. The single instance RDS SLA is also at least 99.5%. Let’s multiply these together 99.5% * 99.5% * 99.5%. Move the decimal, and we get 98.50… which is around almost 11 hours of downtime per month. This is around 33 times the amount of downtime that the company would experience with redundancy. And that’s not including any downtime from your actual application or software. 

Of course, what you lack in redundancy, you save in cash…theoretically. Looking at the architectures, the single instance certainly costs less at surface level, but if your lack of availability is driving your customers elsewhere instead, then you’ll need to factor that into your costs as well. 

Overall, you can use these SLAs to inform your own business uptime requirements, so that you’re broadcasting an accurate number to your customers and users. That’s it for this one - see you next time.

About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.