Concepts and Skills
Practical HA Design
Many businesses host critical infrastructure and technical business assets in the AWS Cloud. Yet, even with so much at stake in the AWS Cloud, many businesses neglect to ensure that their software systems stay online no matter what happens with AWS! In the CloudAcademy Advanced High Availability DevOps Video Course, you will learn critical technical and business analysis skills required to ensure customers can always interact with your cloud.
Watch and Learn:
- Why AWS isn't magic, and you should always plan your strategy with failure in mind
- Mental models for classifying business and IT risk in the AWS Cloud
- The "Big Three" model for increasing the availability of software systems in a methodical way
- Four possible ways to handle IT risk, depending on your needs
- Clear action items for surviving various types of AWS's outages, even entire region failures!
- How to walk through and design highly automated distributed REST APIs in 30 minutes or less
- Financial risk and cost assessment skills to sell the idea of investing in High Availability to key business stakeholders
- When to stop investing in High Availability due to diminishing returns and business needs
This course is essential for any current or future DevOps practitioner or Advanced AWS Engineer wanting to go beyond pure technical skills and move to a business value and strategic decision making role.
If you have thoughts or suggestions for this course, please contact Cloud Academy at firstname.lastname@example.org.
Welcome back to CloudAcademy's course on Advanced High Availability on Amazon Web Services. So in this lecture, we'll be talking about how to understand risk as it pertains to high availability. First we'll look a little bit at how we can predict the unpredictability that is inherent with this kind of discussion where we're looking at mitigating risk. We'll look at a failures quadrant, which tells us the different ways that we can categorize risk, and helps us intuit ways that we that we can handle those different kinds of risks that we might come up with. We'll learn how we can calculate risk as the overall sum or product of a bunch of constituent risks from a system. We'll look at sets of common Amazon Web Services risks that you might have seen in white papers. And then we'll also look at some common non-Amazon Web Services risks. So these are risks that are outside of the scope of an Amazon outage or problem with their technical hardware. And then we'll look at four ways to handle risk as a precursor to some of the techniques that we'll be learning to mitigate risk.
So our first key takeaway for predicting unpredictability is be evil. Get creative and try to mentally break your business's Amazon Web Services processes and systems. Now you don't need to hit your computer with a sledgehammer, but seriously, there are all kinds of ways to break these systems that you probably have never thought of before. And we need to get crazy and think of all the different ways that we can break things, because it's our job to prevent those things from happening. We need to stay one step ahead of the chaos.
So let's take a look at this quadrant here where we have a couple dimensions by which we can measure different kinds of failures. Now these are by no means official, nor are they quantitative, but this is just a very helpful grid for us to classify the different kinds of issues we might have come against. So we have common versus rare, just the frequency with which we can expect things to happen. And we have human versus non human. Is this directly attributable to the actions of some human that just occurred, or is it more chaotic and unpredictable?
So let's look at the exceptions first. So pretty common one, we have corner case code errors. These are things like throwing 500 error codes or if you aren't parsing Unicode input or something, we might have EC2 instance degradation. So this actually happens pretty frequently where a single virtual machine will need maintenance, and Amazon will mark it for instance degradation and retirement in a number of weeks, days, or hours, depending on how unlucky you are. We could have a single EBS failure, where one of the underlying hard drives or pieces of hardware is breaking. We could have a disk full. This is pretty common where people are first entering the cloud. They don't think to rotate out the logs even though the logs are generating a lot of data and text, and then the log gets full, the disk fills up, and the operating system just shuts down and starts breaking. We also see single 5XX errors from different servers that we might be accessing. So for instance, if your application depends on Amazon S3, and you get a 500 error code, do you have the proper logic in place to do the retry? And very common, you might get a noisy neighbor problem. Your server is humming along nicely, but it's running on a T2, and somebody decides that they want to mine Bitcoin right next to you and your server seems to dip in performance. Or even worse, somebody decides that they want to stream a huge movie off of their server and do it very quickly, and they saturate the network connection. You might have some minor service interruptions while that's happening.
Moving over into the outages, little bit more rare, but still pretty chaotic, you might have an AWS availability zone death. So that would be if one of the physically isolated data centers at Amazon breaks. We might have an Amazon Web Services entire service go out. So that would be something like S3 or DynamoDB entirely goes down in a region. We could have a full region outage, which is a lot more uncommon, but actually happened about six months ago as of this recording, where a significant portion of resources all black out in an entire data center in Amazon. And it does happen. We could have a natural disaster, which the last time I remember this happening and affecting Amazon, was in, I want to say, 2012. There was a lightning strike in their Virginia data center and it shut the whole thing down for eight hours or so and lots of sites went down. So we need to plan for potential natural disasters even. We could have a major DNS DDoS. That was one of the attacks that marred GitHub for a while there. Rather than attacking your site, sometimes your DNS provider is DDoSed in a single area, geographic area rather, is brought down or redirected to a different area. It actually happened to YouTube in, I believe, 2009 or 2010. You could also have an Internet service provider go down, so if you have an ISP go down that services Amazon, Amazon might have some of their traffic throttled or reduced. Or you could have one of your customer's segments have their ISP go down so that routing doesn't work, say, between continents or something. There's a whole class of errors around just the delivery from the data center to your customer's houses or businesses.
So moving over into the mistakes area, we're getting into human related issues rather than these chaotic ones. We could have manual deploy misconfigurations. This is extremely common and one of the primary things that you should be looking at mitigating the risk out of for your availability. We could have a main code path bug. So this would be something like you do the deploy correctly, but the actual code that you deploy is faulty, and it's a common code path and not one of these corner cases up in the exceptions quadrant that we have there. We could accidentally kill a server. This is pretty frequently happening to customers that do not use scripts to automate their deployments. So any time that you have a human being inside of your production console in Amazon, or working on the CLI on your production install, you're going to have problems with this. You could underprovision resources. So if somebody accidentally sets the maximum size of an auto scaling group too low and a whole bunch of people start hitting your website or service at the same time, you could cripple the entire system because of poor planning. You could have inconsistent environment configuration. So this is almost a subclass of the manual deploy misconfiguration, but very common one as well where dev and staging are slightly different than production, and all goes well in the staging environment, but then once we move to production, the inconsistencies break the system. You could also accidentally deploy. I've seen this at some startups that are not mature in their DevOps processes, where somebody accidentally sends code to the production environment instead of staging, and the code was not ready.
So moving over into the much more morbid not happy area of the quadrant. We have malice or tragedy. So this would be things people would do intentionally that are bad or really, really bad things that might happen related to your people. So this is kind of morbid to talk about, but when we're doing our DevOps and availability processes, we need to realize that people are mortal and it is potential that somebody could die. So what happens if you have one person that has the root account and no one else has access? You could have a very serious problem because you might not be able to deploy patch fixes, etc.etc. if somebody is otherwise unavailable. You could have, and I've seen this before as well, you could have an ex-employee want to take revenge on you. If your DevOps processes are not high quality enough, and you're not removing people from accounts, you can affect your reliability and availability if somebody goes in and intentionally shuts you off. You could be targeted by malicious hackers. So this is related to that DNS DDoS up there, but this is if your business in particular is targeted and they're trying to crack in rather than just take you down. You could have your AWS account stolen if you're not careful. This one is pretty rare, because there's a lot of security measures in place if you're using MFA and such, but it's actually happened to start ups before where their Amazon account is stolen and the entire business fails because they have no way to recover from that kind of error. Your vendors could go bankrupt. This one is almost a little bit technical, but if you're using a third party API while you're working with Amazon, do you have processes in place to work the vendor out of your system? For instance, if you're processing with a payment provider, i.e. Balance, which went out of business, everybody needed to transfer to another company called Stripe. You could also accidentally delete data. So if you accidentally do a data delete, you have a pretty big problem. This is something like accidentally removing an EBS volume and deleting it, accidentally removing a bucket with a force flag on and deleting everything inside of it. All pretty bad, but uncommon human mistakes that we really don't want to have happen, and while not a systems design problem from a technical standpoint, it certainly is from a process design standpoint. And of course, disclaimer, there are many more examples, and this list is by no means comprehensive, but it is a good starting point for you to be thinking about how many different ways things can go bad.
Now usually people think about only the top three in the exceptions, and top three in the outages over there. But you have to realize that there's a huge, huge list that's beyond just an availability zone failure and instance degradation, which is what most people begin thinking about when they look at availability. But we're going a little bit further.
So let's take a look at how we calculate risk. Risk is the product of probability and severity. Let me repeat. Risk is the product of probability and severity. Probability being the chance that an event occurs. And severity being the amount of damage that is done to the business if it occurs. Now when we're talking about a business, the probability has to do with your technical systems or your human capital issues. Say for instance, the probability that somebody gets sick or the entire team gets sick. And severity has to do with the cost of the site going down in this case, the site or the service. So risk equals chance times cost of occurrences.
Now our expected availability is the first part of a system times the second part of the system, times the nth part of the system. This is the naive system with no redundancy, no failover, no special design into it. So for instance, let's pretend that we're looking at something rather simple. The risk that I have two tails in a game where I want to get two heads on flipping a simple coin: one half times one half, because there's only two parts there. And it's 50/50 chance on both coin flips. Then my risk of losing both coin tosses is 25%. That is, this is assuming that these are independent events, and we can go into it more elsewhere, but this is the simple, simple way to calculate expected availability for naive solution.
So the expected cost of failure is equal to the expected availability, times the cost of downtime. So this is what we were talking about a moment ago here, which is say for instance, I expect to be online all but one hour out of a month, then I can calculate my expected cost of failure by multiplying the cost of downtime. Say I'm running a major ecommerce site, and I lose $50,000 an hour for being down, or I lose that much in revenue, then for the next month I can assume that I'm going to lose $50,000 in revenue due to site unavailability because I'll be down for one hour at a cost of $50,000 an hour. So the cost can get pretty high for this kind of availability calculation.
So is it the chances or the stakes? That's the important question here. It's depending on the kind of business that you're running and how complex your system is, you could have a higher chance or higher stakes of the system going down. So high stakes would be if you have a very high cost of downtime, and high chance would be if you have a high expected downtime or low expected availability.
So when we're looking at common Amazon Web Services risks, let's do a very brief calculation so you can get an idea of how scary it can be if you're not engineering for high availability. So our SLA is our service level agreement, which is Amazon's promise to you. So while we may have done some high availability before, you may have not used the service level agreement to calculate how risky a deployment is. So let's pretend we're building a naive system that does not take into account high availability, and it uses that calculation on the previous slide where we multiplied everything together. So the published EC2 SLA for your servers will be 99.95% up time, which sounds pretty good, but we'll see in a minute here why that's not sufficient for your needs and why you need to engineer around that. The S3 SLA is actually not even that good. It's three nines, 99.9%. The RDS SLA is the same as the EC2 SLA, also pretty high at 99.95%. So all of these sound pretty great, but let's actually do the compound calculation if we're doing a naive system that does not use high availability techniques to increase these numbers.
Our compound probability of a failure or unavailability based on solely the SLAs, which are the thresholds at which Amazon's promise to perform, the compound SLA here is found by multiplying those three numbers together and we get approximately 99.8% availability provided by the SLA. That is because the system will go down... Because I'm using naive system, the system will go down if I have any of these three components fail. Then the whole thing goes down. Then my expected availability, based on the Amazon SLAs, would be 99.8%, which sounds really good but that's pretty terrible. That's 90 minutes of offline time per month if we're assuming I'm having a 31 day month. That's pretty bad. We can't accept that for the majority of businesses. Your boss would be very angry if you were offline for an hour and a half every month, just blacked out because something bad happened.
So we have four ways to handle risk and get better than that 99.8, even for simple deployments. We need to use one or more of the following. We need to either assume risk, which is for small acceptable risks we can do nothing. That is, because we have a very small risk of a natural disaster happening, and if the natural disaster is so bad it won't matter because people are preoccupied, then we can assume the risk of flood or something as long as we can recover shortly thereafter.
We can avoid the risk by using human processes to reduce the chances of things happening. For instance, on our quadrant there everything on the bottom half was human process-oriented. So say for instance, we can say you need to have two people doing a deploy at all times and do paired deploys. That might be a way to avoid risk. We can also mitigate risk by using tools to reduce the chances and potentially even the severity of these things. So for instance, rather than avoiding risk during a deploy process via a two-person deploy system, we can also build a script to do the deploy for us, which would be a mitigation tool that both reduces the chance of failure because it's scripted and the severity because I can just run the script again if I ever have a problem during deployment. And I have a way to do a rollback.
We can also transfer the risk by paying someone else to deal with the risk. This is very common when we're talking about using things like a DynamoDB where we're outsourcing our data storage logic to some other provider that replicates it across multiple availability zones. That's a very good thing, and it's cheap usually to pay Amazon to do it, so we should be looking at that pretty frequently.
So beyond Amazon Web Services risks, we need to look at other things that can break. And we need to generally prefer automation over all else. For instance, we talked about flooding, natural disasters, humans breaking things, humans breaking into your systems and breaking things. As long as we have automation for our entire Cloud, we can create a new cloud anew in however short a time it takes for that cloud to create itself. As long as we haven't lost any customer data, then if we have full automation of our entire cloud, we can de-risk the vast majority of issues from an existential standpoint. However, we do need to continue and talk about ways that we can avoid problems occurring in the first place rather than just having recoverability.
So in summary, high availability is a process, not just a software infrastructure design. As we saw when we were looking at the quadrant, only half of the issues even pertain to technical issues, or rather non-human issues. So we know what to think about now by thinking about that quadrant. Remember, things can get really bad, potentially a team member could die or get very sick, or we can look at things that are pretty mundane, like uncommon exceptions or 500 errors from another system. But there are some other things we might not have even thought about, like lightning striking a data center, or a vendor going out of business. These are all DevOps risks that we need to think about as part of the entire technical system and entire IT system, not just the technical pieces that we write in code.
So next we need to learn about the techniques to actually mitigate these risks that we talked about. So we'll be talking about that in our next lecture: Advanced Techniques.
Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.