This course is focused on the portion of the Azure 70-534 certification exam that covers designing an advanced application. You will learn how to create compute-intensive and long-running applications, select the appropriate storage option, and integrate Azure services in a solution.
Welcome back. In this lesson we're going to talk about what it means to be available, and what factors may contribute to availability. We'll also talk about how the Cloud has given everyone the potential to achieve higher availability.
We've implicitly defined availability at different times throughout the learning path in the context of specific applications. Let's define it explicitly here so that we're using a shared language throughout the course. We're going to say that availability is about having something that's ready and able to be used; which means if something is available, then it is usable. And if it's not available, then it's not usable. In the context of a web application such as Netflix or Etsy, this would mean that if you're unable to use their services and the issue is on their end, then the system is unavailable. Now that's an important distinction; their service is not unavailable if your internet connection is down. So, based on this, we're talking about about up time and down time. If a system is up then it's available and vice versa.
So we have an idea of what availability is, but we haven't talked about why it matters. Now there isn't a singular answer to why system availability matters. Each system is going to have it's own requirements for how available it needs to be and so the reasons for its availability would be contextual. However, some of the common drivers for availability are money, safety, and information. Now the first two may stand out as being fairly obvious. Outages can cost money for companies and they could also cost lives. Imagine if something was as important as a flight control system and that went down. This could potentially cost people their lives. Now the third one may be less quantifiable. Imagine that the systems that run the large Hadron collider had issues with availability. Would we still have discovered Higgs boson particle in our life time? So having systems that are responsible for scientific research, that have low availability could set back research in doing important work. So there are many different reasons why availability matters for systems and the reason is contextual based on the system in question.
I'm going to lead you to wonder why we don't build all systems for 100% availability. And that's a good question. It'd be nice if all the systems we build were usable and available all the time. However, it ignores the real world constraints that exist. Now there are a lot of potential reasons why we as engineers may build something with a lower availability.
First, some systems aren't that important. Imagine you have a small web application that's hosted internally to your company and its primary goal is to send a monthly company newsletter. Now obviously this is an overly simplistic system to help make a point. However, a system like this doesn't require 100% up time and probably doesn't even need to be at 95%.So not all systems are going to require high availability. And this becomes a factor that we need to consider when architecting systems. It's a business decision to determine the desired availability of the system.
Next, cost can be prohibitive. Even if a system could benefit from 100% availability, cost could be prohibitive. Even with the use of the Cloud, some systems would be too expensive to architect for 100% up time. So, cost is a factor when architecting for high availability. This is another business decision. Businesses need to try and get the most up time possible for their system's operational budget. Cost to availability doesn't scale linearly. As an example, the cost jump from 95% availability to achieving 99% availability may only cost a couple hundred dollars a month but the cost from 99 to 99.999% may be several thousand dollars monthly.
Next, system or platform limitations could be an issue. With the adoption of the Cloud, this is becoming less of an issue. However, there is still Legacy systems that need to be run and managed that don't actually support the techniques for horizontal scaling, which is one mechanism that we use to help with high availability. So these are just three of the many possible reasons why systems may not be built with 100% availability. And cost tends to be the largest factor in why something like 99.999% up time is a more achievable number.
So, when we talk about availability it can seem like we're sometimes not talking about time because we talk about it in percentages. We say 99% or 99.9, et cetera. Let's take a look at this table here. This shows the actual allowed time based on the availability number. This chart comes from Wikipedia and you can see there's a pretty sizable difference between 90% up time, allowing for 36.5 days annually, compared to say the five minute and 26 seconds that a 99.999% up time allows.
Now that we know that availability is about having systems that are up and running and usable by those who need to use them, we should talk about some of the reasons that systems can become unavailable. When talking about something like this, there are really just too many possible reasons. So the best we can do is narrow it down to a few common reasons. So that's what we'll do here.
First, we have software bugs. For websites that have minimal complexity, just a few web pages, there still could be hundreds of line of code that are responsible for it. So if you factor in the web server and the operating system, the number of lines of code skyrockets. So a single line of code changing or a configuration file changing could break everything. And the potential for breaking changes grows as the complexity of the site grows. So the code and configuration that is supposed to keep everything running could easily break something and bring it offline.
Next we have component overload. If a server could handle all the traffic we threw at it, then we wouldn't have much to talk about. However, servers can only handle a finite amount of traffic. The exact amount depends on a lot of factors such as the hardware, the operating system, and the software running the network, et cetera. So when a web server, a router, or some other component gets to the point where it's working too hard, it can start throwing errors and stops serving up responses. Imagine it's like a toll booth on the highway. When traffic is light and the cars are going slow, one at a time through the toll booth and then drive off, the cars behind them are unaffected. However, when there's a lot of traffic, this single toll booth is a bottleneck, forcing cars to come to a complete stop. Sometimes for miles back.
Next we have natural disasters. If all of the resources required to run your site are hosted in a single data center, and a natural disaster disrupts the normal operation of that data center, chances are your site will be down until the normal operations for that data center are resumed. This has happened before, it will happen again. Nature doesn't care if we have uninterrupted access to the services we want to use. This can be a challenge because it may require a separate redundant system in a different geographic region.
Next up we have hardware failure. Hardware components will eventually fail and that can cause outages for anything that relies on that hardware. This means that we need to consider the potential for hardware failures when we architect systems. Cloud systems can help us abstract the way from some of the potential implications of hardware failure, though not all.
Next we have malicious users. This can be in the form of a denial of service attack, or something else. Denial of service attack is a form of attack where a system is flooded with requests to the point where it can longer handle the legitimate traffic. This is a common tactic by groups looking to extort money from companies that require their systems to be up and running. They use a distributed denial of service tag and then offer to stop for a price. So this requires us to implement mechanisms that can help us to prevent these sorts of issues from causing an outage to legitimate users.
And this has only been five of the potentially limitless causes for systems to become unavailable. However, in these five, we've covered a lot of the typical causes for system failures. Let's do a quick recap because we've covered a lot in this lesson. Availability means that a system is up and running and usable, and availability matters for different reasons depending on what the system does and what the users expect of it. Unless you start working on things like communication satellites or something along those lines, targeting 100% up time, it just won't be practical. And there are a variety of reasons that a system could become unavailable.
In our next lesson we're going to continue the conversation on availability. Now I know it may feel like we're spending a lot of time on it, and that's true, however availability is an important part of modern systems, so we'll dive in a bit more. We're going to go in depth in the next lesson and talk about how to design systems that have a greater availability and the underlying platform and how managing risk can help us ensure our systems can remain available even if one component fails. Okay whenever you're ready let's dive in.
About the Author
Ben Lambert is the Director of Engineering and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps.
When he’s not building the first platform to run and measure enterprise transformation initiatives at Cloud Academy, he’s hiking, camping, or creating video games.