Architecting with SLAs in Mind
Start course
6h 2m

This section of the AWS Certified Solutions Architect - Professional learning path introduces the AWS management and governance services relevant to the AWS Certified Solutions Architect - Professional exam. These services are used to help you audit, monitor, and evaluate your AWS infrastructure and resources and form a core component of resilient and performant architectures. 

Want more? Try a Lab Playground or do a Lab Challenge!

Learning Objectives

  • Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
  • Learn how to record and track API requests using AWS CloudTrail
  • Learn what AWS Config is and its components
  • Manage multi-account environments with AWS Organizations and Control Tower
  • Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
  • Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight
  • Learn how AWS CloudFormation can be used to represent your infrastructure as code (IaC)
  • Understand SLAs in AWS

Let’s consider a simple architecture for a small company. The company wants to measure their uptime percentage for two reasons: 

  1. They want to commit to a percentage of uptime and broadcast that to their users 

  2. And they want to track how they’re performing against a goal, so they can ideally improve their availability over time. 

Let’s say this company runs a traditional three-tier app. They have a web tier, an app tier, and a database tier. The web tier is made up of multiple EC2 instances across multiple AZs. The app tier also is made up of multiple EC2 instances across multiple AZS and the database tier uses a multi-AZ RDS cluster. 

With this architecture, the company is introducing redundancy of components both at the EC2 instance level and at the Availability Zone level, which increases their availability level overall. So in this case, I can use the Region-level EC2 SLA which commits to at least 99.99% uptime. For both the web tier and app tier, I can put 99.99%. 

And I know from looking at the SLA documentation earlier, that the multi-AZ RDS Cluster currently has a 99.95% availability SLA. 

If you want the total availability of this system, you multiply the availability of each component. So, in this case, it would be 99.99% * 99.99% * 99.95%, in which case we get 99.93% after moving the decimal back a few places to accommodate the percentages. 99.93% makes sense because the availability of the whole system can’t be greater than the least available tier, which means we can’t have a higher number than 99.95% for this architecture. Ultimately, this number is really good as it provides an expectation of only about 22 minutes of downtime per month. 

Of course, this 99.93% only really covers the infrastructure - the software, application code, or any deployment processes aren’t included in this - so you’ll need to factor in the availability of those components as well. However, it’s worth noting that your software and applications have a dependency on the infrastructure it runs on - so your application can’t be more available than the infrastructure it runs on - in this case, can’t be more available than 99.93%. 

Let’s compare this to a single EC2 instance for the web tier, and a single EC2 instance for the app tier with a single instance RDS database. We lose the redundancy at the instance and availability level, so our availability metrics go down. Let’s see by how much. The instance-level SLA for EC2 is at least 99.5%. The single instance RDS SLA is also at least 99.5%. Let’s multiply these together 99.5% * 99.5% * 99.5%. Move the decimal, and we get 98.50… which is around almost 11 hours of downtime per month. This is around 33 times the amount of downtime that the company would experience with redundancy. And that’s not including any downtime from your actual application or software. 

Of course, what you lack in redundancy, you save in cash…theoretically. Looking at the architectures, the single instance certainly costs less at surface level, but if your lack of availability is driving your customers elsewhere instead, then you’ll need to factor that into your costs as well. 

Overall, you can use these SLAs to inform your own business uptime requirements, so that you’re broadcasting an accurate number to your customers and users. That’s it for this one - see you next time.

About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.