This course covers the Architect An Azure Compute Infrastructure part of the 70-534 exam, which is worth 10 - 15% of the exam. The intent of the course is to help fill in an knowledge gaps that you might have, and help to prepare you for the exam.
Welcome back. In this lesson we'll talk about virtual machine availability.
In order to understand how to implement high availability for VMs, it's important to consider the different causes for a system to become unavailable. Now while failures do happen at the operating system and application level, this lesson is focused on the Azure specific causes.
Now there are two Azure related causes for a VM to be unavailable. Which are planned and unplanned maintenance. Planned maintenance is where Microsoft makes some sort of update to the Azure platform. This could be any sort of improvements. For example, it could be security patches, performances fixes, or anything like that.
A lot of the time you won't notice that anything has even happened. However, there will be times when your VM will need to be rebooted. For example, if Microsoft patches the hypervisor that's running the VM. Now unplanned maintenance, as the name suggests, happens due to some unexpected event. Now this could be a power failure in the rack, hardware failures, network failures, or some other failure that happens at the server rack level.
Because of these two possible causes for outages, Microsoft provides a concept of update domains and fault domains. These correspond to the two types of outages I just mentioned. To ensure that your solutions remain available during a planned maintenance event, you can use update domains. Update domains conceptually combine virtual machines together into groups where the VMs can be rebooted at the same time.
To help with unplanned maintenance, Azure offers fault domains. And a fault domain represents an individual rack of servers. You can use between one and three fault domains, which essentially means that your VMs are being run on physical servers that are in different racks. Now because of this separation, should there be a rack level issue, the other servers will still be available.
The way that you use fault and update domains is through an availability set. The availability set allows you to change the number of fault and update domains. And in order to ensure high availability and have an SLA of at least 99.95 percent, you need to have at least two VMs, and at least two fault domains, and two update domains. Azure will automatically alternate the distribution of the VMs into the fault and update domains.
Now here's an example, if you have two fault domains and four update domains, then the first VM will be created in fault domain one, and update domain one. The next VM will be in fault domain two and update domain two. The third VM is going to be in fault domain one and update domain three. The fourth VM will be in fault domain two and update domain four. And then if you created a fifth VM it would belong to fault domain one and update domain one.
Notice how it loops through the fault and update domains to evenly distribute the virtual machines. When you plan out the design of highly available applications, it's common to create an availability set for each application tier.
And then use a load balancer to communicate between those tiers. Now this ensures that the individual components are created in fault and update domains of their own. Preventing the likelihood of all of your servers for a particular tier from being unavailable at the same time. And the load balancer ensures communication between the tiers, because the load balancer will direct traffic only to running instances.
When it comes time to create an availability set, you'll need to determine how many fault and update domains to use for your application based on its expected usage. Using availability sets will help to ensure that your application has a high level of availability.
However if everything is inside of one region, then your availability will only ever be as high as that region's SOA. Regional outages aren't common, however they can and do happen. So, if you need higher availability than a single region, you'll have to use multi-region deployment.
When it come to multi-region deployments, there are different options for how you might configure things depending on your availability requirements and your budget.
If you need an extremely high level of availability, then you can use an active/passive model with hot standby. With this approach you have another version of your solution running in a secondary region and it doesn't serve up any traffic unless there's a failure in that primary region.
A variation on that is the active/active model, with geo-location based request routing. Now this is similar to the previous option, however, the solution that's running in the secondary region is actively serving up requests to the users who are closer to that region than the primary.
Then there's the active/passive model with cold standby. Which basically means that there's not a solution running in a secondary region, rather it's dynamically created when the first region is unavailable. This is a great option if you want to balance the cost versus the SOA. The switchover is not going to be immediate, however, with a well defined automation plan, this is a viable option.
All right, that's going to wrap up not only this lesson but this course as well. In the next course we'll pick up on the next domain objective. So if you're ready to keep learning then I'll see you in the next course.
About the Author
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.