Concepts and Skills
Practical HA Design
Many businesses host critical infrastructure and technical business assets in the AWS Cloud. Yet, even with so much at stake in the AWS Cloud, many businesses neglect to ensure that their software systems stay online no matter what happens with AWS! In the CloudAcademy Advanced High Availability DevOps Video Course, you will learn critical technical and business analysis skills required to ensure customer can always interact with your cloud.
Watch and Learn:
- Why AWS isn't magic, and you should always plan your strategy with failure in mind
- Mental models for classifying business and IT risk in the AWS Cloud
- The "Big Three" model for increasing the availability of software systems in a methodical way
- Four possible ways to handle IT risk, depending on your needs
- Clear action items for surviving various types of AWS's outages, even entire region failures!
- How to walk through and design highly automated distributed REST APIs in 30 minutes or less
- Financial risk and cost assessment skills to sell the idea of investing in High Availability to key business stakeholders
- When to stop investing in High Availability due to diminishing returns and business needs
This course is essential for any current or future DevOps practitioner or Advanced AWS Engineer wanting to go beyond pure technical skills, and move to a business value and strategic decision making role.
If you have thoughts or suggestions for this course, please contact Cloud Academy at firstname.lastname@example.org.
Welcome back to the Cloud Academy Advanced High Availability course for Amazon Web Services. So in this lecture, we'll be doing our final lecture talking about how to assess business value by doing a costing demonstration of the same infrastructure that we just created in the planning demo. If it's been a while since you watched the previous lecture, I would suggest briefly reviewing the type of architectures that we went over since we were building a RESTful JSON API with various levels of high availability considerations taken into account. This costing demo will go over how we can assess whether or not it's a good idea to proceed along the continuum of increasing complexity. So we need to assess business value for this high availability and see if it's actually worth using super advanced techniques like doing a multi-region deployment. We can actually cost these things based on risk.
So without farther ado, let's get into our costing demo. I'm going to switch over into my chart view and record my screen as we do some spreadsheeting. So first, let's take a look at the progression of the different levels of sophistication that we talked about in the previous lesson when we we are actually planning out the different level of high availability that we can build into this JSON RESTful API system. Now, keep in mind, we didn't include all components of a larger system. If we had offline jobs and other distributed system components we would also include those. I mainly focused on the real-time and user data aspects of the system.
So first, we see we have a simple system where, if we're speaking JSON, we just include the DNS system, a compute layer and a database. And this is actually the core architecture that we'll be repeating in many different flavors for different levels of high availability. So as we move across the progression, realize that we really only have a couple different aspects like this. So, moving over into another level of high availability, we might create a system that's behind an Elastic Load Balancer and has multiple instances so we can survive instance failure with switchover from the Elastic Load Balancer and removal from this auto-scaling group. We then further added auto-scaling not just to the compute layer but also to the database by adding a CloudWatch events or CloudWatch alarms-triggered Lambda function to scale up or down the DynamoDB layer. So now we have high scalability at both the compute and database layers and the DNS layer's already scaled by Amazon for you.
So here we're actually adding in an extra region so we can get a little bit of additional availability by doing an active-passive model where it might fail over to another system somehow by replicating from writes that happened in DynamoDB here over to the other side, and realizing that because we're using auto-scaling in both groups, we can, if we have a failover in the primary region, expect this single region that may be running only one server at a time, usually, to scale up relatively quickly as we begin spinning things up here.
We then looked at doing a multi-master setup where we might have a write conflict mechanism between the databases and rather than having one database as the master and one as the slave in the remote region, we copy back and forth using the same mechanism that we have in the primary region into the other one. And we have a write conflict mechanism to make sure that we're not doing dual writes.
Then finally, a more sophisticated system where we actually put Amazon CloudFront in front of the resources so we can serve up stale reads sometimes if we need to cache the gets. We're making sure that we're separating out any kind of user file uploads into a separate S3 bucket rather than using DynamoDB or rather than using disk space on these different instances and using a Lambda, perhaps, to manage the synchronization between the two or Amazon's built-in replication. Plus we're realizing that down here in addition to the bi-directional stream replication with write-conflict mechanisms, we've also added a queue-based mechanism to enqueue writes before we pass them over to the other region because we realized that we might drop writes replicated across regions during an outage if we're using this for high availability purposes. And this enables us to go down for a couple of hours on one side and have the system automatically and passively re-build the synchronization levels between the two databases.
So, like I said, we have a whole progression here, we're just looking at the different levels of sophistication that we can do for high availability. Now this is nice and everybody would love to have one of these self-managing system so that the Ops team doesn't have to get up late at night, but we have to realize that all of these high availability questions are simply technical risk mitigation and we have things like CloudFormation to prevent humans from messing up as well. But at the end of the day, our goal is to save money and not to make a pretty system diagram or even a well-run machine. So we need to balance the amount of cost-to-business with the amount of effort it takes to produce one of these systems.
So there's actually a relatively straightforward way to do these calculations. The most difficult part when you're doing a high availability cost calculation is assessing this goodwill factor here. But let's look briefly at the kind of spreadsheet that you can create when you're actually putting these kinds of things into practice. So if you recall in our earlier lecture, we talked a lot about the different categories of risks and different issues that you might face in a quadrant. So, realizing that we have any number of different issues that we can see, we should create a table here with the different risks and either an outage duration and times per year calculation that will automatically give us the percentage uptime or a fixed percentage for things in the case of service level agreements like Amazon S3 says that we will have 99.9% availability, not durability. So availability will generally be a lot lower than durability. But anyway, we can calculate out the overall percentage uptime of a system. This is using a naive calculation here. So we may actually change these line items as we mitigate out the risk of the EC2 SLA being a problem. So once we get to multi-region, if one region goes down rather than using this percentage SLAs since that's on a region basis, we might rather input the switchover time and difficulty. So say we have a region go down once per year, and since we've done multi-region we might assign an outage duration of about five minutes so we can get switchover and scale up in a multi-master setup.
So let's look at what we actually need to calculate, we just talked about the different ways that we can calculate uptime: either from a fixed SLA that we're aware of or by the number times per year something occurs and the number of minutes that it costs plus just giving it a human readable label so we can understand what risks we need to mitigate out. We then multiply the risks together to get the aggregate uptime since the assumption here is that these are independent components that can totally fail the system. So then, this right here, this calculation is for the naive diagram that we saw at first. So we can label the different versions of the architectures. So realizing that there will be several versions that we can price out in terms of business impact here on this left side.
Today, we'll be mostly talking about the cost-to-business of the uptime or downtime since it's actually a relatively straightforward and obvious calculation to do. These calculations here as we know the exact price of each of these instances and each of these icons on this diagram have a fixed cost associated with them based out of Amazon's pricing tables. So you can go and do that on your own. There are a lot of AWS costing tools that they provide to you, however, there's no tool like this that is provided off the shelf to you from Amazon. So again, looking back over here, we understand that this is the uptime percentage calculator. I have over here the number of hours per year. This one is just 365 times 24. So this might be a little bit longer for a leap year, but I'll just use a normal year. This is the revenue per year that runs through the system. So if you think about this could be, this $10 million here might be a medium-sized eCommerce company, could be pulling in $10 million a year through their website. So, we care about this number because this is the amount of money that we're pulling in through a site and the reason that we're trying to keep the business available is both to keep this number up since people can't buy things if the site is down, and also we care about this entire thing because of this other green box here which is the goodwill factor.
So, this is realizing that you can't just let your site be offline for half of the year and expect to lose only $5 million if you're a $10 million a year in revenue eCommerce company. If you're offline for six months, in all likelihood you'll lose a lot more than half of your revenue even though it was only half of chronological time because people won't want to shop there anymore and they won't trust you. So this goodwill factor is a multiplier to understand the amount of damage that downtime will affect your reputation. So two is actually a pretty low one. It's just saying, take the amount of actual averaged dollar cost, average damage from just taking this many dollars per year, this many hours per year. So this many dollars per hour. That's the naive calculation. But if we see here, my dollars per hour downtime cost also includes this goodwill factor as a multiplier. So this one is...this is the most difficult number to calculate on this entire sheet. You need to work with your business to figure this number out.
So this number of hours per year down is simply one minus this percentage expressed as a decimal. So that would be 1 minus 0.998 times the number of hours in a year which is give us 17.5 roughly. And then our dollars per year is just these two numbers multiplied together. The cost of downtime in this scenario on average. So again, this is just an average downtime calculation, since assuming you're an eCommerce company in just, say, the United States, there's a very specific four-hour time window that is the daytime or just after work or something in the United States where your revenue will be higher than, say 0300 hours for any given time zone. So again, this is naive but on average it works since we're just doing an average revenue per year and average down time per year based on your expected uptime percentage over here on the left.
So now that we understand what we're doing with this chart, let's move forward and actually look at...oops, just added a tab there. Let's move forward and look at the different systems that we have in place here. So, the Amazon Route 53 SLA is actually 100%, and that's because they have quadruplicate replication across different zones and I believe the SLA is based on unavailability for an entire minute and the failover happens very quickly. So four whole regions would have to go down for this SLA to be broken and that's why they actually promise you 100% when you're doing this.
So in the case here, where we have a system that's running on EC2 and DynamoDB, if we're expecting the EC2 SLA to be 99.95%, the availability of any given instance is actually a little bit lower. Let's say that you expect to have instance degradation every six months, and when you have an instance degradation and you're in the scenario when we're not doing failover, we can actually calculate that as... So we can calculate this as, say it takes us 30 minutes to spin up a new instance and it happens twice a year and we're actually going to reuse that formula there. So, if we have that kind of uptime and we're a $10 million a year eCommerce company, say our goodwill factor is actually five since customer won't shop at a store that they don't...oops. Since customers won't shop at a store that they expect to be down all the time. Looking at something like this, this is actually a pretty high uptime for any given system because it's not very complex, but let's also factor in the fact that we only use a single instance in this case.
Okay, well, what does that look like? We can say on Black Friday which in the United States is a holiday where people go and shop a lot after our Thanksgiving holiday which is a national holiday in the U. S. It's famous around the world for being extremely heavy on eCommerce traffic every day. So I can plan ahead and understand that with my single instance setup here, I'll likely crash my system. Say we have three holidays like that per year, and I'm going to go down for eight hours. Say, maybe a dozen holidays like that. I'll go down for 8 hours each time which would be 480 minutes. So, if we look at holiday surges, this is actually not too uncommon for different eCommerce companies. Even large sites like Target.com went down for entire days on the major shopping holidays in 2015. Looking at our availability here, you might think "Oh wow! 98.8% uptime is not too bad for this very simple, cheap setup here." But, if we realize that we're doing a times five goodwill factor because people don't like it when your site goes down, we're going to have...we're going to have four entire days of downtime or 105 hours. Four and a half almost. So, four and some fraction. So, if we look at this as a cost-to-business, there's a pretty compelling reason that we don't want to do this simple system here.
So let's duplicate that sheet, call this one “Simplest.” And look at, by the way I'm using a spreadsheet program available for Macintosh called Numbers. You can also do this in Excel or any other...you can even do it by hand. So any kind of modeling tool that will let you do multiplication and addition, this is really good for. Numbers just happens to be pretty and good for doing screen share. Okay. Let's look at another system here. We have another JSON API where we're looking at doing an auto-scaling group here where we have an Elastic Load Balancer, etc, etc. So the Elastic Load Balancer SLA is actually the same as the EC2 SLA. So we can expect, if we have instance degradation behind the ELB, then this shouldn't happen throughout the year anymore nor should we have holiday surges that cripple the system very much anymore. Say we only have it two times a year. In the United States, I can think of two holidays where this would happen, where it would surge past my ability to scale outwards without baking my AMIs or pre-creating my AMIs during deployment. And then we also introduce an extra piece of complexity where if the ELB goes down, then we can look at something like another SLA here where there's another component that I can break and we can also...and actually let's go back at this one and say, "Human Deploy Errors". Say we do that once a month and cause 60 minutes of downtime each time. Our cost is just skyrocketing as I input these pretty reasonable outages. And then, we have the AWS region SLA which is 99.95 of general availability for the region being connected to the internet. So we're looking at almost 10% of our revenue and cost being removed which is pretty serious, but actually relatively reasonable based on how I've seen different eCommerce providers go down in the event of denial of service attacks, different out-of-stocks, those kind of things. This is actually pretty reasonable because these availabilities might not be total blackouts throughout the year, but this is on average. This could be eight hours a month. Maybe where aggregate, it was only happening at 5 or 10 seconds at a time because the site was too slow or something and the page wouldn't load. But this is totally reasonable to happen and this is the kind of availability that you see sometimes at poorly designed websites.
So again, looking at this eCommerce vein again here, we get our ELB SLA in here. We have region SLA again which we can look at, 99.95. We have our human error that on average is going to cost us maybe an hour, a month, or 12 hours throughout the entire year. So our availability has actually gone up considerably as we got rid of the instance degradation problem and reduced the number of holiday surges that we're crippled by. Even though we introduced this extra 99.95 by allowing ourselves to do auto-scaling and instance replacement in the case of a degraded instance, we significantly upped our business value. Just even for that simple auto-scaling across availability zones example here, we went from roughly $700,000 USD a year to $259,000 or $260,000 a year. So that's a huge return on investment just for a very simple change. This was a trend that we'll see is that we'll see a high amount of value between these two, and a little bit less, and a little bit less. But if we put an extra zero after this revenue per year and we're looking at millions of dollars a year for a large eCommerce site, that makes perfect sense. For example, if you're looking at any of the top, I don't remember what the range is, but the top several dozen eCommerce retailers do over a billion dollars a year in revenue, which would be 100 times this value. So you're looking at losing on the order of tens of millions of dollars in goodwill and such if you're going down for this amount of time throughout the year. And that's only half a percent in outage and you could be losing a massive, massive amount of money.
So if we're looking at with auto-scaling here. So we're looking at auto-scaling compute here. Then we need to look at this next version where we're auto-scaling DynamoDB. So that would be similar to this same situation but say, because of our holiday surges, we actually moved this down to a zero value. Then we're now, just even for moving the zero value down on the surges, we're looking at a difference from almost $100,000 even at this $10 million scale which is a pretty small company, right.
So let's look again at another technique that we can use we already talked about in the other lecture. We're looking at an active-passive failover model where we have a switchover and we're replicating across data centers here. So if we're looking at something like this, then we need to realize that what we've done is we've effectively mitigated it so both regions have to go down for any of these services. We need to, basically, square the chance of downtime and then subtract it from one to get the chance of uptime, and then we're also looking at introducing a little bit of downtime when we do a switchover instead. But that's a flat number instead of a percentage number, so that's good. So, our regional SLA, we can duplicate this sheet. And then when we're trying to sell our...if we're looking at an active-passive model here, we still have a Route 53 SLA. We don't have instance degradation anymore. Our EC2 SLA has to go down in both regions for us to have downtime.
So, there's actually more complex modeling where you can do the chance of downtime and look at these as independent systems. So, what we can do here is look at the entirety of these active systems, realizing that in our...this version we have auto-scaling at all and we have this percentage here. This is actually the percentage uptime for a single region version of this product. So if we look at this figure here, oops, I want to copy that text. If you look at that percentage there, we can just call this single region up and give it 99.663 uptime per region. Then we look at that single regional failure and say we need both regions to fail for us to go down. So we can actually do this number. We say...oops. That's very difficult to see. So we would do this number. That is the percentage downtime there. So we look at the chance of both failing and then do one minus that number and get basically triple nines there. So we're looking at five nines for this one, but we still have human errors because we're not creating a system whereby we're automatically deploying these things to have parity. So we're potentially increasing...we've just increased double the amount of places that we have to deploy to. So for not doing any deployment automation, we've actually just doubled our chance that we mess something up if we have no automation at all. And then, we're also looking at a switchover cost here. So, looking at these guys, delete it...we need the regions up but we also do...if we have a switch come up, and we say that maybe twice per year a region goes down and we have a five-minute switching cost there. Then we're looking a two-thousandths of a percent down time. So now we've gone from when we've increased the complexity considerably we've saved another $30,000 here, which is not a big deal for this $10 million company but it will be a big deal for a larger eCommerce company. So now we've looked at the effects of what doing this high availability has for the multi-region setup.
So assuming we're not just doing multi-region failover where we're now doing a latency and failover record, not just a failover record where we're serving multiple request from these two zones, we shouldn't have any latency when we do a switchover with the failover because your redirection at the load balancer, if we have any problem when we're entering into this side, the switchover should be on the order of half of a minute or something. So, if we do this, and we say let's move over to active-active instead. If we do an active-active, we can move our switching cost to something like 0.5 and this is just a rounding error here. So if we want to look at what that looks like, we've again, not decreased any of our human error SLA pieces out yet, and we've also assumed that our application level software is working at all times but this just makes our calculations easier. So we won't be dealing with the application code, however, we do care about the automatic deployment scripts. So we'll leave this human error in here because you've actually...once we've doubled the amount of places that I have to deploy the traffic to, or deploy the code to rather when we split the traffic, I still have that doubled human error calculation here.
So I've only shaved a couple minutes off here. However, because if we push with active-active, we should have two brains and that should make our uptime significantly better where this mess up will happen less frequently because we can service requests from two different places at the same time. You would have to mess up these human errors in two different zones at the same time. Now, we can't assume that they're independent events because we're likely to run the same script on both sides, but if we're just eyeballing and saying that we cause ourselves an hour of downtime. I'll cut that in half again just as a naive way to calculate this. However, you can do your own probability modeling. The important part here is understanding what the effects are of each of these technologies. You can actually calculate this out however you like based on your own probability models, based on your internal processes. So the percentage chance that failure in both regions happen depends entirely on how you do your business process to do the deployment.
Okay, so now we just saved another good chunk of change here. We just saved about $70,000 US dollars in this case again by moving over to...we're still using our compound region uptime from the auto-scaling all as our top line region out. We're still realizing that both regions need to go down for us to have a serious outage. And we're still looking at...even if there's a downtime for both and we switch over, we say it happens twice a year and we give it about 30 seconds for the system to do a reroute and realize that there's an outage based on an ELB health check or something like that.
So, we've gone all the way to this system but we still haven't engaged with CloudFront, CloudFormation, and S3 synchronization and queue rebuilds. So, if we do another duplication of this, we can imagine if we do active-active, fully built out. So I need to change these names, don't I? Apparently I can't type while I'm being recorded, so that's my own failing. So this would be active-active all. If we have active-active all with all of these things in place where we're also using Amazon CloudFront and rather than mounting the latency-based record directly to the client, we're going through CloudFront through edge locations to serve up cached pages for any get requests and then routing back if we need to hit the origin for any kind of write request. We are further mitigating the chance of this switchover cost since CloudFront will be able to do our failover routing and we can add any kind of failover logic we want here. We also have S3 in play here so we're not beholden to the EC2 outages and such, and we're looking at doing CloudFormation deployment so we will essentially move our human errors to zero since we can perform...zero might be a little too unrealistic, but essentially, because I can deploy via CloudFormation consistently to both regions and I can deploy into my staging environment, then I can virtually guarantee that I won't have human errors since I can just promote my staging environment that I verified with integration testing to be 100% working. So doing something like this, we're looking at almost hitting four nines there with a single region being up.
So, again, realizing that even though Amazon Web Services is really great, we're still looking at about an hour of downtime a year mostly due to human error here, and if we set that to zero, then we're looking at almost zero downtime. So if you increase your business processes to be extremely tight where you're using CloudFormation, and I actually have an advance CloudFormation course on Cloud Academy if you want to go check it out. It's for how to actually model with one of the systems and get this 0 and 60 here. So with our best high availability practices, we can go from days and days of downtime to six minutes for fundamentally the same application that does the same exact thing if we just leverage all of the tools and technologies that AWS gives us out of pocket for a good switchover system for multi-region, multi-master, that kind of thing.
So now, this doesn't look like...why would I...it's obvious why I wouldn't want this very bad system. But even for this single region where I'm not auto-scaling the database, and I'm just eating the cost on some of our auto-scaling, it's up to you to decide which of these layers you want your business at. So for instance, for this $10 million a year company, we're only looking at the difference between $260,000 and $170,000 in loss per year assuming a goodwill factor of five. So, if you're saying that your cost of going down is five times the actual average revenue cost, then that's what you'd be looking at. Most businesses that I consult with actually use 10 as their system modeling. So, if we go and move all these to 10, that is for every hour that I am down...let's speak it in a way that make sense. So for every day that I am down per year, I lose 10 days worth of revenue because of customer goodwill loss. And that makes perfect sense. So if a major eCommerce store was just offline for an entire or fractions of a day throughout the year, and it was spotty and unreliable, then you would get upset, you wouldn't shop there anymore, and you would think it would cost 10 times.
Now, if we're looking at this, even this $10 million business it's very obvious that we shouldn't just be using a single EC2 instance. If we're doing something like this multiple availability zone, so we're looking at this model here with the auto-scaling and the compute layer, and we still have holiday surges blocking us out because we're not auto-scaling database. We don't have appropriate automation in place and we're seeing this 12 human errors per year that might cause an hour of downtime, we're still looking at half a million dollars a year because of this goodwill factor here. If I'm a new business and I'm offline for two full days out of the year, then it's very likely that rather than just losing two days of revenue, I'm losing something like two-thirds of a month or 20 days because people just don't trust my site. It's spotty, unreliable, and nobody wants to input their credit card information for a site like that. Again, I'm using an eCommerce example but this would also hold true for something like an API company if you're doing payment processing or anything else.
Looking again at this third model that we went through. We went through auto-scaling even the database to burn down this holiday surges problem. So just traffic surges shouldn't bring us down if we're auto-scaling every layer of the entire stack. So we're looking at now, still $300,000 a year. So you'd have to ask yourself moving from roughly $520,000 a year to $336,000 a year, is it worth it to your business? So $190,000. Is it worth putting a few several, probably two full-time DevOps people on the problem? Probably, because there's enough cost here that it makes sense at this scale. Now, if you look at going multi-region, the jump is a little bit less dramatic. We're looking at $336,000 to $276,000, right. So that one's not that big of...for the active-passive rather. So that one's not that big of a jump. Is it worth it for a company of this size? Maybe, if you think a single person working on it can solve the problem in a fraction of a year.
Then moving again to another one, when we move active-active and we remove the cost of the...even those five minutes of switch plus we've reduced human error because we have two regions that can actually service even if we mess up a deployment on one, the other region will continue functioning correctly. We've cut down a significant amount of cost here. This is at least one, highly-paid, full-time DevOps person between $276,000 and $138,000. So here, when we're looking at doing something, we're now moving to active-active here. This makes perfect sense to do for an eCommerce company of this scale if we're saying our goodwill factor 10, and even if our uptime for a single region is five nines. By having two regions that operate fully independently and can service all requests, we've mitigated human risk as well. This human error is one that people forget as we had talked about in our previous lectures.
So when we move over to something like an active-active, fully deployed, highly automated system with automated templating and CloudFormation rollouts, replication across buckets, Amazon CloudFront, mounted in front for get request caching and auto-healing so that even outages are effectively non-events from the perspective of users, and we've effectively moved our human error basically to zero. If you have appropriate, directly responsible individual processes and such, you can burn this down to zero. Some Telecoms do it, and anybody that's a major, major tech company that you can't think of the last time that they went down, something like a Google maybe. This is what they do. They have multiple regions. They have highly sophisticated human processes and all deployments are automated. Google doesn't use CloudFormation, but they do use automation that is the same thing, but you can image that you could get this human error down to 0 or, you know, not 60-minute durations, smaller number here. Maybe one time a year for five minutes.
So when we're looking at this level of high availability where we've gotten it down to about...on the order of single-digit minutes per year because we've got this highly sophisticated automated distributed system across multiple geographic regions which doesn't really roll off the tongue, but you can see that there's clear business value here for a goodwill factor of 10 and $10 million a year in revenue. So, just really quickly, so if you're working for a larger company, and you're assessing whether or not it makes sense to invest in something like this high availability. When you're looking to justify this kind of investment to your business owners, if you're operating a billion dollar operation which is not unheard of, major eCommerce company, Amazon Web Services itself. You can imagine this is why they hire so many engineers to do operations and keep the thing online because even in the case of Amazon, even a couple minutes of down time, Amazon's actually closer to $7 billion here and because their uptime is so important, their goodwill factor might even be higher than 10.
If you can imagine, pretending you're Amazon, the reason that you're doing something like making it extremely highly available, a couple of minutes is a million of dollars in downtime for these guys. That's really, really bad if you're getting hours of down time and that's why you want these multi-region high availability systems working. So, if we're looking at these billion dollars companies, we're looking on the order of hundreds of millions of dollars in loss if you're looking at being offline for five days out of the year because you end up in the news, you're the laughing stock of the town, etc, etc.
So just with auto-scaling our compute layer and not having any database scaling in place with a fixed size database that's under provision before allowing this scaling on the billion dollars we're still looking at, mind you, this is a billion dollar company, so we're looking at 5% revenue loss because of the goodwill factor. So 5% is huge number for a billion dollar company, and as we move across the line here, we still see we're measuring these errors on the order of millions of dollars until we get to this fully active-active deployment with sophisticated processes that mitigates out human risk.
It's very important for us to realize that this might be how if you werenot taking these advanced DevOps on Cloud Academy. For you to go and see an ad for DynamoDB online, Amazon might say "Oh, set up your first sample application, " and they may advocate that you create this for your very first sample application when really if you're getting ready to deploy into production, you need something more like this, which is why you're here on Cloud Academy. Even people who claim to know Amazon Web Services might build something like this, and I'm sure you're seen something like this where perhaps rather than DyanmoDB here, we've substituted in RDS which actually has the same SLA and doesn't have any ability to do this auto-scaling to speak of. So you might think "Oh wow! Look how good my setup is." That just doesn't work once your company gets past a certain size. So even if you're a start-up with a million dollar or $10 million in revenue or a million dollars in revenue just with the auto-scaling compute layer, this is what a lot company see. This is what's it's going to cost them per year on average if we're looking at having not very sophisticated DevOps processes, relying on the Amazon SLAs and only having auto-scaling on the compute layer. Pretty bad.
So we need to apply everything that we've learned in this course to go through this process that I just did, taking into account not only the availability concerns for Amazon, but also for people getting sick, human error, these kinds of things. So I've bundled everything into human error, but you can actually do much great cost breakdown. And this goodwill factor needs to be a sophisticated analysis typically or just a guess if you're a start-up and you don't have the time or money to invest in that analysis. But this the hardest number to come up with. Revenue per year is fairly straightforward and actually the rest of these percentages and projections are pretty straightforward if you write a calculation like this. All you need to do is create a table of all of the independently adjusted probabilities, multiply them all together to get the uptime, add up hours per year, revenue per year, make a goodwill factor, and pick hours per year down as simple one minus the decimal representation of this percentage, times the number of years, times the number of hours of downtime gives us our cost per year or if we include our goodwill factor there.
Hopefully during this lecture you've picked up a couple of good tools. So that was the end of our costing demo and the end of actually our entire series on high availability, advanced high availability and design for Amazon Web Services on Cloud Academy. Hopefully, you learned a thing or two. You've taken away our various skills around understanding the different kinds of risks that you face when designing a software system in the Amazon Web Services cloud. You've talked about how to mitigate the different risks that you might face including rare and common errors in both the human and non-human varieties. We also walked through a planning session and the logic that we might go through when designing, for instance, a RESTful JSON API that needs to run in Amazon Web Services. And finally we took back our different six options for different sophistication levels for high availability, priced them out in terms of the cost to business, expected cost to business of downtime, and then realized that we can check where the threshold is and pick which of our six we need to work on based on our goodwill factor and our revenue per year. For low revenue businesses, it might make sense to pick one of the less sophisticated setups because there's a lot of build-out required there, and for more sophisticated companies and high revenue companies, it's a no-brainer that you need to employ these high availability techniques and go and work on some of those more advanced templates that we were looking at in the flow chart.
So thanks for watching. Hopefully all of that sunk in and you can move forward with you Amazon Web Services DevOps engineering. Refer back to these different flow charts on these slides for different level of sophistication as you move forward with taking your exam. Thanks.
About the Author
Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.