Advanced High Availability on AWS
2h 28m

Many businesses host critical infrastructure and technical business assets in the AWS Cloud. Yet, even with so much at stake in the AWS Cloud, many businesses neglect to ensure that their software systems stay online no matter what happens with AWS! In the CloudAcademy Advanced High Availability DevOps Video Course, you will learn critical technical and business analysis skills required to ensure customers can always interact with your cloud.

Watch and Learn:
 - Why AWS isn't magic, and you should always plan your strategy with failure in mind
 - Mental models for classifying business and IT risk in the AWS Cloud
 - The "Big Three" model for increasing the availability of software systems in a methodical way
 - Four possible ways to handle IT risk, depending on your needs
 - Clear action items for surviving various types of AWS's outages, even entire region failures!
 - How to walk through and design highly automated distributed REST APIs in 30 minutes or less
 - Financial risk and cost assessment skills to sell the idea of investing in High Availability to key business stakeholders
 - When to stop investing in High Availability due to diminishing returns and business needs

This course is essential for any current or future DevOps practitioner or Advanced AWS Engineer wanting to go beyond pure technical skills and move to a business value and strategic decision making role.

If you have thoughts or suggestions for this course, please contact Cloud Academy at


Welcome to's Advanced High Availability course on Amazon Web Services. This lecture we'll be doing our availability intro. So first, we're going to go over the definition of availability one more time, the textbook definition, that is. Then after we do that, we will begin to ask some clarifying questions around the higher-level problem availability actually presents. We'll restate and rethink the problem, so we'll come up with a new problem statement for what availability really is, and what we're trying to do as DevOps engineers. We will think about the availability of the business, and talk about how it's a little bit different than the availability of a single software system or component. We'll talk about diminishing returns and at least address the considerations that we need to take into account besides just making the software more available. At what point do we need to stop making it more available, etc.? And finally, we'll restate a summary of the goals for this course.

So again, what is availability? Well, "High Availability is a characteristic of a system which aims to ensure an agreed level of operational performance for a higher than normal period." Thank you, Wikipedia. Now what this actually really means is that we're looking at designing a system such that the entirety of the system has more availability than the constituent components. So let's think about what that means. For instance, a DNS service is more highly available than, say, the single server that a DNS query might be responded to by. So we're designing a systems such that the constituent parts are enhanced by each other and the whole of the system becomes more available.

Now the Wikipedia definition points out at a high level that we are trying to exceed some threshold, but we need to think about what that actually means. So let's look at the higher level problem for high availability. Really we're trying to answer a couple questions that pertain to our business, right? We want to know if we can stay online through problems. Problems can be different types of problems. We'll get into that a little bit later, but it doesn't necessarily need to be failure of the software that we wrote. This could be failure of power outage, this could be a natural disaster, this could be any number of things could affect the availability of our system. Can we stay online through issues that we may face? How do we stay up when pieces go down? So again, particularly in the modern Cloud, and in Amazon Web Services is like this is well, we have a lot of parts that go into any given cloud system. So for instance, if you're running a RESTful Web API, you have DNS compute players, database potentially if you're storing your own state and not just referring out to a third party, all kinds of interconnected pieces. So how do we stay online and make sure we're just not showing blank screens or dropping requests when constituent parts of our system might go down?

Now I'm getting into really quickly here is the whole greater than the sum of the parts? So like we said, different pieces may break. That is, a hard drive might fail, an operating system my break, whatever. There's all kinds of things that can go wrong with little pieces of my software. I want to make sure that my entire system is available, or at least the majority of my system is available even when constituent parts break.

The main question that we're really asking is can we effectively and efficiently reduce risk? So risk, that's going to be a keyword for the rest of this course. Availability deals with the risk of having your software be not available. So the only reason that we care about making our software highly available is because this is technical risk that we assume for running a software business, or a business that is using software. For instance, we don't really care about making software available for the sake of making software available. There has to be some sort of business goal that justifies the expenditure of time, effort, and money into reducing this risk of downtime.

So when we rethink the problem, we think of high availability as a technique that we use for risk mitigation. So I'm going to sound a little bit MBA here, but this is the IT version of risk mitigation. So in thinking through the rest of this course, keep a couple of things in mind. Amazon Web Services, while awesome, is not magic. There's a tendency to assume that nothing will ever go down when you start purchasing it from a...purchasing software services from a large vendor, like Amazon Web Services. So we need to think about how we manage the risk out of a business using AWS without going bankrupt.

So there's two parts here. We can use AWS as a risk mitigation technique in and of itself versus, says, an on-site data center or private cloud, but we also need to think about parsing the sentence a little differently. How do we manage the risk out of a system that is using AWS without going bankrupt? So while using Amazon is generally less risky than using other data centers, there is some risk still associated with Amazon because again, it's not magic and they have outages sometimes.

So how much effort and money should we spend on mitigating out this risk? So for instance, if you're Netflix, Netflix runs entirely on Amazon Web Services as of the recording of this video. Netflix has a vested interest in not going down, right? They collect their revenue based on the reliability of their services, and the cost of business is very high when they are offline. Or, say, an e-commerce store.

So let's start thinking about high availability of a software system as a business availability problem, not just a technical availability problem. It is not your job to keep servers online. Let me repeat that. You are watching a CloudAcademy video, but it is not your job to keep servers online. We can't keep all the parts online all the time. That is, the old way of doing operations before we had a cloud and we could spin up different constituent parts very quickly was artisanal operations. People used to keep servers online all the time. Well we can't do that anymore because we don't have physical access. And it doesn't make sense to try to do so.

So servers aren't the only things that break, and we calculate where to focus on risk. So beyond just keeping servers online, we can calculate the parts that may or may not fail. So your job is to keep the business online. That's a key difference between keeping individual servers online and playing the artisanal operations juggling game like the people used to do when they were running on site infrastructure or bare metal.

So we have to think about everything in terms of money. There's only one goal, save money, right? That may seem a little strange talking about a technical system, but if we think about the reason that we want to stay online is that we don't want to lose revenue, or lose customers, or lose goodwill. Everything boils down to saving money. So we need to think about, when we're doing our DevOps engineering, how to save our business the most money by knowing when to stop when we're doing a high availability engineering, and only committing enough effort to meet the 80/20 rule.

So in summary, this course is going to teach us how to learn technical and non-technical DevOps skills. We're going to come up with a process to identify risk, discuss some ways to mitigate risk both technical and non-technical, and we're going to know when to stop. That is, know when we are going too far and it no longer make sense to spend time, or money, or effort on increasing availability because it costs more than it's worth. And we'll integrate availability and risk mitigation processes into our workflow because it'll become second nature to think about things this way. So we'll learn system design processes to deliver business value through IT risk mitigation.

So next step, we'll be looking at how to understand risk, going over the different kinds of risk that we can have, and the different techniques that we have to mitigate it out.

About the Author

Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.