1. Home
  2. Training Library
  3. Amazon Web Services
  4. Courses
  5. Solution Architect Professional for AWS - Domain One - High Availability, Scalability and Business Continuity

Setting the Scene


Advanced High Availability
Setting the Scene
43m 44s
Setting the Scene
3h 31m

Course Description

In this course, you'll gain a solid understanding of the key concepts for Domains One and Seven of the AWS Solutions Architect Professional certification: High Availability, Scalability and Business Continuity. 

Course Objectives

By the end of this course, you'll have the tools and knowledge you need to successfully accomplish the following requirements for this domain, including:

  • Demonstrate ability to architect the appropriate level of availability based on stakeholder requirements.
  • Demonstrate ability to implement DR for systems based on RPO and RTO.
  • Determine appropriate use of multi-Availability Zones vs. multi-Region architectures.
  • Demonstrate ability to implement self-healing capabilities.
  • Demonstrate ability to implement the most appropriate data storage scaling architecture
  • High Availability vs. Fault Tolerance.
  • Scalability and Elasticity.

Intended Audience

This course is intended for students seeking to acquire the AWS Solutions Architect Professional certification. It is necessary to have acquired the Associate level of this certification. You should also have at least two years of real-world experience developing AWS architectures. 


As stated previously, you will need to have completed the AWS Solutions Architect Associate certification, and we recommend reviewing the relevant learning path in order to be well-prepared for the material in this one. 

This Course Includes

  • 1 hour and 13 minutes of high-definition video.
  • Expert-led instruction and exploration of important concepts. 
  • Coverage of critical concepts for Domain one and Domain Seven of the AWS Solutions Architect - Professional certification exam. 

What You Will Learn

  1. Designing a back-up and recovery solution.
  2. Implementing DR based on RTO/ RPO.
  3. RDS back up and restore and self healing capabilities.
  4. Points to remember for the exam.

Welcome to CloudAcademy.com's Advanced High Availability course on Amazon Web Services. This lecture we'll be doing our availability intro. So first, we're going to go over the definition of availability one more time, the textbook definition, that is. Then after we do that, we will begin to ask some clarifying questions around the higher-level problem availability actually presents. We'll restate and rethink the problem, so we'll come up with a new problem statement for what availability really is, and what we're trying to do as DevOps engineers. We will think about the availability of the business, and talk about how it's a little bit different than the availability of a single software system or component. We'll talk about diminishing returns and at least address the considerations that we need to take into account besides just making the software more available. At what point do we need to stop making it more available, etc.? And finally, we'll restate a summary of the goals for this course.

So again, what is availability? Well, "High Availability is a characteristic of a system which aims to ensure an agreed level of operational performance for a higher than normal period." Thank you, Wikipedia. Now what this actually really means is that we're looking at designing a system such that the entirety of the system has more availability than the constituent components. So let's think about what that means. For instance, a DNS service is more highly available than, say, the single server that a DNS query might be responded to by. So we're designing a systems such that the constituent parts are enhanced by each other and the whole of the system becomes more available.

Now the Wikipedia definition points out at a high level that we are trying to exceed some threshold, but we need to think about what that actually means. So let's look at the higher level problem for high availability. Really we're trying to answer a couple questions that pertain to our business, right? We want to know if we can stay online through problems. Problems can be different types of problems. We'll get into that a little bit later, but it doesn't necessarily need to be failure of the software that we wrote. This could be failure of power outage, this could be a natural disaster, this could be any number of things could affect the availability of our system. Can we stay online through issues that we may face? How do we stay up when pieces go down? So again, particularly in the modern Cloud, and in Amazon Web Services is like this is well, we have a lot of parts that go into any given cloud system. So for instance, if you're running a RESTful Web API, you have DNS compute players, database potentially if you're storing your own state and not just referring out to a third party, all kinds of interconnected pieces. So how do we stay online and make sure we're just not showing blank screens or dropping requests when constituent parts of our system might go down?

Now I'm getting into really quickly here is the whole greater than the sum of the parts? So like we said, different pieces may break. That is, a hard drive might fail, an operating system my break, whatever. There's all kinds of things that can go wrong with little pieces of my software. I want to make sure that my entire system is available, or at least the majority of my system is available even when constituent parts break.

The main question that we're really asking is can we effectively and efficiently reduce risk? So risk, that's going to be a keyword for the rest of this course. Availability deals with the risk of having your software be not available. So the only reason that we care about making our software highly available is because this is technical risk that we assume for running a software business, or a business that is using software. For instance, we don't really care about making software available for the sake of making software available. There has to be some sort of business goal that justifies the expenditure of time, effort, and money into reducing this risk of downtime.

So when we rethink the problem, we think of high availability as a technique that we use for risk mitigation. So I'm going to sound a little bit MBA here, but this is the IT version of risk mitigation. So in thinking through the rest of this course, keep a couple of things in mind. Amazon Web Services, while awesome, is not magic. There's a tendency to assume that nothing will ever go down when you start purchasing it from a...purchasing software services from a large vendor, like Amazon Web Services. So we need to think about how we manage the risk out of a business using AWS without going bankrupt.

So there's two parts here. We can use AWS as a risk mitigation technique in and of itself versus, says, an on-site data center or private cloud, but we also need to think about parsing the sentence a little differently. How do we manage the risk out of a system that is using AWS without going bankrupt? So while using Amazon is generally less risky than using other data centers, there is some risk still associated with Amazon because again, it's not magic and they have outages sometimes.

So how much effort and money should we spend on mitigating out this risk? So for instance, if you're Netflix, Netflix runs entirely on Amazon Web Services as of the recording of this video. Netflix has a vested interest in not going down, right? They collect their revenue based on the reliability of their services, and the cost of business is very high when they are offline. Or, say, an e-commerce store.

So let's start thinking about high availability of a software system as a business availability problem, not just a technical availability problem. It is not your job to keep servers online. Let me repeat that. You are watching a CloudAcademy video, but it is not your job to keep servers online. We can't keep all the parts online all the time. That is, the old way of doing operations before we had a cloud and we could spin up different constituent parts very quickly was artisanal operations. People used to keep servers online all the time. Well we can't do that anymore because we don't have physical access. And it doesn't make sense to try to do so.

So servers aren't the only things that break, and we calculate where to focus on risk. So beyond just keeping servers online, we can calculate the parts that may or may not fail. So your job is to keep the business online. That's a key difference between keeping individual servers online and playing the artisanal operations juggling game like the people used to do when they were running on site infrastructure or bare metal.

So we have to think about everything in terms of money. There's only one goal, save money, right? That may seem a little strange talking about a technical system, but if we think about the reason that we want to stay online is that we don't want to lose revenue, or lose customers, or lose goodwill. Everything boils down to saving money. So we need to think about, when we're doing our DevOps engineering, how to save our business the most money by knowing when to stop when we're doing a high availability engineering, and only committing enough effort to meet the 80/20 rule.

So in summary, this course is going to teach us how to learn technical and non-technical DevOps skills. We're going to come up with a process to identify risk, discuss some ways to mitigate risk both technical and non-technical, and we're going to know when to stop. That is, know when we are going too far and it no longer make sense to spend time, or money, or effort on increasing availability because it costs more than it's worth. And we'll integrate availability and risk mitigation processes into our workflow because it'll become second nature to think about things this way. So we'll learn system design processes to deliver business value through IT risk mitigation.

So next step, we'll be looking at how to understand risk, going over the different kinds of risk that we can have, and the different techniques that we have to mitigate it out.

About the Author
Andrew Larkin
Head of Content
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.