This is the first of six preparation courses for the Architecting Microsoft Azure Solutions 70-534 certification exam. By the end of this course, you will have gained a solid understanding of Azure data center and VPN architecture. We will cover Azure’s use of Global Foundation Services for its data centers, virtual networks, Azure Compute (IaaS, virtual machines, fault domains), VPNs, and ExpressRoute. This session will also feature a high-level discussion of Azure services (load balancing options, Traffic Manager, and more).
Welcome back. In this lesson, we'll be talking about Global Foundation Services. We're going to cover Microsoft's data centers, regions and special regions, as well as designing for failure.
So let's start off with Microsoft's datacenters. The Cloud is the modern way to create and deploy new IT solutions. It allows us access to resources on demand with no upfront investments. Companies can progressively reduce on-premises hardware and software infrastructures because Cloud platforms offer it in a self-service, on-demand, pay-per-use way. Microsoft is investing in large facilities containing many thousands of servers, and between the large number of servers and the fact that continuous improvements in technology allow more VMs to run on a single device, the price for us as customers is able to go down.
Now Azure isn't just one datacenter or region. It's multiple data centers distributed all over the world. Microsoft has been releasing new regions to expand its coverage and it's currently up to 30 as of the time of this recording. However, that's likely to continue to grow. All datacenters are managed by a Microsoft organization named Microsoft Cloud Infrastructure and Operations. However, it used to be called Global Foundation Services. So you may see them both used interchangeably. In order to prevent a regional issue, by that I mean a natural disaster, power outages, etc. impacting our running services, Azure pairs regions and if you replicate your workload across pairs, then you can ensure higher availability than using a single region alone.
Azure currently operates in 140 countries, in 10 languages and 24 currencies, and that's because different countries have different needs. Having a datacenter in a country can affect service or data affinity. Some companies or government agencies may require that their data or services are hosted in some regions but not others. Another benefit of having worldwide datacenters is reduced latency. If your service is hosted in the West Europe region but most of the traffic comes from the west coast of the U.S., then users will experience slower requests than they need to. So switching it to run in a region closer allows for a better user experience. Alternatively, you could have Azure automatically direct requests to the endpoint with the least latency.
It's worth mentioning there are some regionally unique exceptions. For example, in China, Azure is hosted with a third-party ISP due to special requirements, since the datacenter isn't allowed to share data inside or outside of China. Another special region is Brazil South which doesn't have a paired region in South America but is paired with South-Central U.S. And that's a non-reciprocal pairing, meaning South-Central U.S. is not paired with Brazil South. And then there are some government-only Azure regions, which are tailored to the needs of specific government agencies. If you notice that certain regions are unavailable to you, that's possibly due to Microsoft trying to ensure a reasonable amount of latency. For example, Australian regions are available only to users with billing address in Australia or New Zealand.
So we've talked about regions, though we haven't covered services. Not all services are available in all regions. When a service is rolled out, that service is deployed only to some regions. After an initial period of time, the service goes to mainstream and is declared generally available and then it becomes available to most regions.
Running datacenters on PRAM is different than the datacenters used by Cloud providers. The traditional key performance indicator for on-PRAM datacenters was "mean time between failures," usually abbreviated MTBF. And to accomplish the goal of reducing the time between failures, high-quality hardware was essential. However, the Cloud model has changed things. Cloud datacenters are big and the traditional principles used to build on-premises facilities just aren't applicable here. Azure is comprised of hundreds of thousands of servers and that means failure of some sort is going to be likely. So using high-quality hardware in this case isn't as important because if something breaks, it's likely there's another node running that can handle that workload. This makes automation a very important part of Cloud operations because if hardware fails, then whatever workload was using that hardware needs to be automatically moved to another node. So with lots of hardware and a high level of automation, we have a new key performance indicator that matters more than mean time between failures. That indicator is mean time to recover, and that's abbreviated MTTR. MTTR measures how long it takes to recover from a failure from the moment that failure occurs. With automated processes that can handle moving services to healthy nodes, we can achieve high availability with inexpensive hardware, and so MTTR is the key performance indicator for the Cloud datacenters. We know and accept that things will break. However, because there's so much available hardware to swap over to, the time to recover for any given failure is really pretty low.
Just because the datacenter is highly automated doesn't mean that there won't be people around to handle some tasks. And since there will be people working in the datacenter and around your data, the question of data access comes up. Datacenter personnel aren't allowed to just access customer data freely. There are strict policies about customer data management. There are exceptions, for example, if a customer opens up an issue about a service, then support may need to look at your data to help resolve the issue. And in that case, you can grant them access. However, it's going to be limited to a specific time frame and fully monitored.
Okay, let's switch gears a bit and talk about how Azure helps us to mitigate the impact of failures. Failures inside data centers can impact anything from a single machine to multiple racks. That's why Azure introduces the concept of fault domains and update domains. Both concepts involve groupings of physical machines. Every physical machine and the virtual machines inside of it live inside a fault domain and an update domain. An update domain is a group of machines that can be updated at the same time. More accurately, it's a group of machines that can be rebooted at the same time. The word update in this context refers to the updates made by Microsoft and typically to the underlying infrastructure. During the update process, a machine may become unavailable and so the separate update domains ensure that, in this case, not all of our servers will be down at the same time. A fault domain is a group of machines that can fail. This typically corresponds to a server rack. They can fail because they share the same power supplies, cooling systems, networking, etc. A fault domain is a group of machines for unplanned outages. Outages can impact the service-level agreement, usually called an SLA, which is the probability of a service running without outage as a percentage of time during a year. To ensure an SLA of 99.95% for a service running on a VM, you'll need to deploy the service on at least two VMs and those VMs need to be in the same availability set. Two or more VMs can join the same availability set, which ensures that the two machines don't live in the same fault domain or in the same update domain. Using an availability set will mitigate the risk of an outage.
If you have two VMs in the same availability set with different fault domains and you do experience an outage, it's likely that it's due to a regional failure. Now, regional failures aren't common, however they can and do happen, and that's where the cross-region replication we talked about earlier helps.
Alright, we've covered a lot of information in this lesson. In our next lesson, we're gonna be talking about designing virtual networks. So if you're ready to keep learning, then let's get started with the next lesson.
About the Author
Ben Lambert is the Director of Engineering and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps.
When he’s not building the first platform to run and measure enterprise transformation initiatives at Cloud Academy, he’s hiking, camping, or creating video games.