High Availability
Start course
1h 14m

This course is focused on the portion of the Azure 70-534 certification exam that covers designing an advanced application. You will learn how to create compute-intensive and long-running applications, select the appropriate storage option, and integrate Azure services in a solution.


Welcome back. In this lesson, we're going to pickup our discussion on Availability. We talked about it in our last lesson in a rather abstract way. So in this lesson, we'll dive into more concrete information. We're gonna talk about how to create highly available systems in general and not specific to any one cloud provider.

From our previous lesson and maybe even from your experiences, you know that systems can become unavailable for a lot of different reasons. Let's look back at the five that we talked about in our last lesson and see if we can't turn those potential reasons for outages into a more resilient system.

First up we had Software Bugs. In this example, we have several servers and let's imagine that they're all running the same web application. Software bugs can come in the form of bugs in our code or bugs in someone else's code and by that, I mean bugs that are introduced into the operating system or some third party piece of software that we used. So if we deployed some buggy piece of software to all of our systems at the same time and that bug resulted in an outage, then our system would be unavailable, because none of the servers would be able to respond. And so deployment strategies that can minimize this potential for downtime are how we ensure availability from software related changes.

In the Introduction to Continuous Delivery course, we talked about two deployment methods for avoiding downtime. If you haven't watched that course, then you may want to check out at least the lesson on deployments. Mitigating downtime from software bugs requires a lot of testing at different levels of the continuous integration and continuous delivery process. And also requires deploying in a way that doesn't change the currently running servers. So that, should you need to roll back, you can use the existing servers. There are different ways of doing this such as Blue-Green and Canary deployments. The important thing to know at this point is that there are well established patterns for deploying software changes to production that will help avoid downtime. And you can always look them up later by searching for Zero Downtime Deployments.

Next we talked about Component Overload. This is pretty common issue and can be caused by both legitimate traffic and malicious users with a goal of causing downtime. Have you ever seen an interesting article on Hacker News or Reddit? And you clicked on the link to find that the page doesn't load, it maybe throws an error. This is called a Reddit Hug. What happens is... too many people are trying to go to that site at the same time and the server or servers are overloaded with too much traffic. It could be that too many requests are hitting the servers and using up all the memory. Or it's taxing the hard drive or... some maximum number of sockets or it could be something else. But the gist of it is that the server or servers aren't able to keep up with the demand. So, how do you avoid something like this? Well this is where scaling can come into play, either vertical or horizontal, depending on the amount of traffic.

Next we have natural disasters. If all of your servers are in a data center in one place, let's say they're in San Francisco. And an earthquake happens and cuts off communication to the data center, then your system is going to be unavailable. Natural disasters happen and so when we plan for high availability, we need to consider this. There are different ways we can deal with this depending on what you've identified as your availability requirements. Now, either way, it's gonna result in the need for setting up an environment in a different geographical region. This environment may be always up and running and syncing a read only copy of the database. Or it may just be that you have a disaster recovery plan in place, so that you can easily get a backup of your production environment running in a near region. Now the option you choose will depend on your specific uptime requirements.

Next, we have Hardware Failure. It doesn't matter if you're using your own hardware or it's in the cloud and somebody else manages it for you. If hardware fails, then your system can too. So this is where a redundancy comes into play, which means you have a backup of components that kick in to replace broken components. Redundancy happens at all layers of technology. Servers may have redundant power supplies, hard drives, etc. Our system may use a database and that database may have a redundant database that's synced up and ready to become the primary, should something happen to the current primary. We often use multiple data centers inside of a geographical region to serve as a redundant copy of that system. It allow us to prevent a single point of failure. Removing single points of failure is important. If that one component breaks, then the entire system can go down.

Next we have Malicious Users. In regards to availability, malicious users can cause an overload of a component like we talked about earlier, and cause a denial-of-service. Preventing a service disruption from a denial-of-service attack can require some planning before you implement your architecture. You need to start by reducing the attack surface. The attack surface is a way of saying, the systems and components that are publicly-facing and can be attacked. You do this through some form of firewall. Virtual private clouds offer this sort of firewall for you in the cloud. Depending on what's being attacked, a web application firewall may also help to mitigate a flood of HTTP requests. And you'll need to be able to scale your system out, so that it can accommodate the additional load. There are different services that can help you mitigate the denial-of-service attacks in the cloud. Things like Route 53 and CloudFront on AWS. And App Engine allows for blacklisting IP addresses and subnets on Google Cloud. So there's a couple of cloud options that can help you.

So we've covered the same five common reasons that we talked about previously for systems to become unavailable. Only we talked a little bit about how to combat them. Let's go through some designs and talk about the availability of each.

In this example, we have a single web server connected to a database server and both the web and database server are directly exposed to the world. So we have a couple single points of failure. Now, that may sound confusing, because we're saying that there are multiple single points of failure. And it may sound like that can't happen. However, the word single is in reference to a single component. And it means a component that doesn't have a backup. If either of the web server or the database server was to go down, then our system would be unavailable. Now, depending on our requirements, this may be okay. Remember, it is a business decision to determine the ideal availability. So if this configuration meets the availability, budgetary, and security requirements, then this may be all you need. Assuming that the web and database servers are large enough to handle the expected load, then we need to know if this design does meet our uptime requirements. So we start by determining the availability of the underlying system. So let's assume that the hardware for our web server has an uptime of 99.95% and our database server has the same. So without considering the potential downtime from system updates and deployments, we're at an estimated 99.9% uptime. Which means, because we need both systems to be up and running for our system to be available, we've added roughly 43 minutes of additional potential downtime per month. Because both systems need to be available, so we can calculate the estimated availability by multiplying the two availability numbers together. So 0.9995 x 0.9995 is 0.99900025. 99.9 is around nine hours of potential downtime per year.

What if we required a better base uptime? Let's start fixing up this design and see if we can't improve that uptime. We need to add a redundant web server, so that we have a backup. And so we'll start by adding a load balancer. For the sake of this discussion, we're going to assume that the load balancer is capable of scaling to handle the load as needed. So we'll kind of be ignoring it a bit. And we'll be adding a second web server and again, we'll assume that it has a base uptime of 99.95%. So now, the base uptime of our web servers has gone up to 0.99999975 by having the two of them, because each individually has a 0.0005 chance of failure. So the odds of both of them failing is 0.00000025. Now if we calculate the availability of this system, we can see that we're at roughly 99.95% uptime. So we were at 99.9% availability and now we've increased it to 99.95 by adding an additional web server. We're now looking at a base downtime of roughly four hours per year. But we still have a single point of failure with that database, so let's take another pass, and try and make it an even better uptime.

So here we've added a read only replica that will kick in only if the primary database fails. So, our database is no longer a single point of failure. And we end up with an estimated base uptime of 99.9999%. Up until now, we have been talking about this stuff at a generic level. So there are things we haven't considered in our design. Things like, what happens if a server does go down? How do we get it back online? Or how do we get a server to replace it? The same goes for the database server, if it goes down, what happens next? What happens if our data center goes down due to some sort of natural disaster? Or what happens if our site goes viral and we're suddenly seeing ten times more traffic? So what we've done is architect a system that looks okay on paper, but it's not great. So for noncritical low traffic sites, we could possibly include a disaster recovery plan and call this good. However, for systems that need to be highly available, we need to do a bit more work. So let's keep building this out.

In this example, we're gonna be offline if something happens to our data center. So we can add in an additional level of availability by deploying to a mirrored environment. Now the mirrored environment could be either a cold, warm, or hot site. The word site in this context refers to a data center. Here's the difference between them. A cold site is basically having access to a data center and it's used for recovery reasons. A warm site is similar to a cold site except that the data center will contain hardware required and is just waiting for you to load your software and data, and use it. This allows you to get up and running faster than a cold site. And a hot site is all set up with a hardware and software, and you just need to switch over the traffic to it.

When talking about the cloud, this is a little bit different, because the cloud providers abstract away the hardware. A cold environment basically just means that you have your code, assets, and data backed up, so that you can redeploy them in the cloud to some location. A warm environment in the cloud often means that you have the infrastructure configured, however it's not ready to handle the amount of traffic from the production environment yet. Maybe you need to start up more VM's or get the data loaded from the production environment. So it's not an immediate switch over, but it's much faster than a cold environment, because the infrastructure is already configured. And a hot environment is one where everything is set up. The data is always synced and it will be able to handle the traffic when you switch it over. So it's ready to use at anytime. And this is another area where it's gonna depend on the requirements of the system, whether you go with cold, warm, or hot. The more mission critical a system, the higher the availability it will require. And that means, we'll need it to be more resilient.

So let's mirror our environment. This is similar to our previous design except we're now going to use a separate data center to run our mirrored environment. Notice that this environment on the right doesn't have the same number of servers as the primary environment on the left. That's because this is a warm environment, meaning that you can switch over traffic to the backup environment should you need to, but it's not going to be able to handle the load immediately. It may need to spool up some additional servers first. This is useful when you need to minimize the risk of a data center outage, but don't have the budget for a hot environment. Here, we have a hot environment, it's a mirror of what's in the primary environment, and it's just sitting there waiting to have traffic routed to it. Again, it depends on the system's requirements, which one you wanna build.

You may notice that these two data centers are located in the same geographic region. Now if there was a natural disaster, it's possible that they're both gonna be out of commission. Depending on what happened, it's possible having your system unavailable is not the priority. If your data centers are down, because aliens came to Earth and destroyed them, then you probably have other things on your mind. However, if maybe a small earthquake happened that disrupted your data centers connectivity, then the user base from outside of that area may not expect to see a disruption in their availability. So that's where leveraging multiple regions makes sense. Again, this could be a cold, warm, or hot environment. Maybe having the ability to restore to another region in a reasonable amount of time is acceptable.

So as long as you have a disaster recovery plan that doesn't involve those downed data centers, then maybe you're okay. Or maybe you want a warm environment like the one here where you could route traffic to the other region, but you're gonna lose some of the requests as the new environment spools up enough resources to meet the demand. Or for something where less risk of an outage can be tolerated, maybe a hot site like this one here is optimal and you use both environments, serving up requests to the users from the closest region. And this will help avoid latency as well.

So with this design here, we currently have a platform that will be highly available. Though, these designs don't cover things like data replication, failover, network configuration, and recovery methods. And that's because those topics start to go from beginner to intermediate. And it also starts to get into actual implementations. I wanna keep this at a generic enough level that it will work across multiple cloud providers. The cloud has made it easier to create highly available systems, because we could push a button or enter a command on the command-line and have an entire environment built for us. Cloud platforms make is so that we don't need to think about a lot of the challenges of running and maintaining hardware. We get to think about data centers in a more abstract way. Terms like zone allow us to think about the concept without needing to understand how that data center is managed.

Cloud platforms also offer a lot of value due to the ability to use resources on demand. If you need extra servers to handle a spiky load in traffic, than you can add them to the load balancer and then shut them down when the traffic dies down. Saving you the cost of buying servers that you would only need to use on occasion. And cloud services for file storage tend to be highly available and highly durable. So cloud platforms offer a level of ease for highly available systems that is difficult to achieve without them.

We're gonna wrap up this discussion on Availability by summarizing the key takeaways. First, don't over engineer. You need to consider the real-world constraints when architecting systems. Second, when you need high availability, avoid single points of failure with redundancy. This includes, at the component level and at the environment level. By a component, I mean things like servers, data bases, etc. And by environment, I mean the complete system, so having a mirror of that environment running in another location. Next, the level of availability is a business decision, not a technical one. Next, understand the percentages. For example, you'll wanna know the difference between 99.95% and 99.99% uptime.

The designs we've made so far have all assumed that the number of servers that exist will support the traffic. However, what if your site goes viral and you gain 10 times the traffic? In our next lesson, we're gonna talk about how to solve that problem with Scalability. So if you're ready, let's get started.

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.