The course is part of this learning path
High availability and disaster recovery are key to ensuring reliable business continuity. While SAP workloads are mainly confined to Azure's infrastructure layer, it is still possible to utilize many Azure functions and features to enhance system reliability with relatively little effort. This course looks at when, where, and how to use Azure's built-in infrastructure redundancy to improve system resiliency and how various database high availability options are supported.
Learning Objectives
- Understand the key aspects of high availability and disaster recovery
- Learn about availability and availability zones
- Learn about Azure Site Recovery and how to implement it through the Azure portal
- Learn how to set up an internal load balancer in the context of SAP workloads
- Understand the Azure support options for Pacemaker and STONITH
- Learn how to implement Data Guard mirroring via the Azure CLI
- Set up Windows Failover Cluster and SQL Server Always On through the Azure portal
Intended Audience
This course is intended for anyone who wants to use Azure's built-in infrastructure redundancy to enhance the reliability and resiliancy of their SAP workloads.
Prerequisites
To get the most out of this course, you should be familiar with Azure, Azure CLI, SAP, SQL Server, and STONITH.
High availability within an Azure data center is easily enabled through availability sets, the core mechanism for handling operating system updates, and localized hardware faults. An availability set is two or more virtual machines assigned to fault and update domains. Virtual machines that share a power source and a network switch are in the same fault domain. You can have VMs belonging to the same availability set spread across three fault domains, protecting you against hardware or power failure within a rack.
Update domains enable high availability while scheduled maintenance takes place. Spreading VM's across update domains will ensure that your system will continue to function while OS patching occurs. All VMs within an update domain will reboot simultaneously. To ensure maximum resilience within a data center, you would utilize three fault domains, the maximum, and have two update domains within each fault domain. This is probably overkill, as you would hope Azure wouldn't try to apply OS patches while VMs are down, but maybe not if the patch application is the cause of the VM's failure. Use managed disks when creating virtual machines. Managed disks will be deployed to the same fault domain as the VM, isolating disks within the storage cluster, avoiding a single point of failure.
Proximity Placement Groups that collocate VMs physically near each to reduce latency need to be set up in a particular order when used with availability sets. It is the availability set rather than the individual virtual machines that are assigned to the proximity placement group.
There is no charge for availability set functionality, only for the virtual machines in the set.
While availability sets operate at the data center level to ensure resiliency, availability zones operate across geographically proximate data centers. The primary purpose of availability zones is to protect against the failure of a whole data center. As data centers in an availability zone tend to be located within a few miles of each other, network latency is lower than between data centers in different Azure regions. The proximity of availability zones means they can double as high availability and a disaster recovery solution. Availability sets and zones are mutually exclusive. A virtual machine can only belong to either an availability set or an availability zone. As you would expect, a proximity placement group cannot span availability zones.
Creating an availability set requires specifying the number of fault and update domains; nothing else needs to be done. The same can't be said for availability zones. In a high-availability configuration, load balancing, replication, and failover need to be implemented between zones.
Azure Site Recovery is a service that replicates physical or virtual machines enabling you to failover to the backup machine in the case of an outage with the primary server. The service can be used either with on-premises or cloud-based servers as the primary machines, on the face of it, a straightforward business continuity solution. However, there are several factors to consider. Having multiple machines kept in sync doesn't happen by magic, so cost and synchronization latency need to be considered. Then there is the issue of the mirror machines' location. What if the whole data center goes down, or a zone, or a region? When replicating to virtual machines outside of an Azure region, will data sovereignty obligations be compromised? How are disks and data replicated?
Azure Site Recovery supports two scenarios in terms of mirror or backup machine locations. Replication can happen between Azure regions or, where supported, between availability zones. High availability zones are where VMs are replicated between data centers geographically adjacent within a region and connected via a high-performance, low latency network. Not all Azure regions have data centers configured as high availability zones. Still, you can check on the Microsoft website as the number of high availability zones grows with the expansion of Azure data centers. And, of course, there is replication between regions.
An availability zone configuration can be used as a DR solution where the primary Azure region doesn't have a paired region that shares the same data sovereignty rules. In this scenario, you're putting all your eggs in adjacent baskets, so to speak, but RTO and potentially RPO will be lower. When failing over to another zone within the same region, you can choose to use the same virtual network, thereby maintaining network addressability.
Azure site recovery takes care of replicating virtual machines but failing over from the primary to the backup is a manual operation.
I want to show you how to set up virtual machine replication using Azure Site Recovery as a disaster recovery solution. This will involve replicating a source or primary VM to another availability zone, performing a test failover, a production failover, and then failing back to the primary machine from the mirror. I'll start by creating a virtual machine call sapappserver and selecting availability zone as my availability option. There are three availability zones within the selected region to choose from, and I'll stick with 1, the default — next, the disks. You must use managed disks when placing a VM in an availability zone. Even though you don't have the local VM redundancy of availability sets, disks are replicated locally by default. Apart from these settings, the VM is created using the default settings.
Once the VM has been deployed, we can see it is in availability zone 1. To set up replication using Azure Site Recovery go down to Disaster Recovery under Operations in the left-hand menu. As you can see, we can set up disaster recovery between zones or regions with Azure Site Recovery. We can utilize the same virtual network when configured between zones, but when between regions, we'll need a Vnet in the target region. I'll select yes for disaster recovery between availability zones. Instantly the map changes showing only the West US 2 region, and we're reminded that not all regions support availability zones.
Under advanced settings, we see the target availability zone is 3, and a new resource group will be created with "asr" appended to the current resource group name. While proximity placement groups can't span availability zones, the target VM can be placed into a proximity group in the target zone. In a DR scenario where a whole zone goes down, you'd expect all VMs within a proximity placement group to fail over to their respective clones within the mirror PPG.
Click next to see a summary of the replication to take place. At the top, there is the source and target VM details below the name of the replica managed disk created. I'll hit start replication to kick off the process. The replication does take a few minutes, about 10 in this case. When the synchronization starts, we get a graphical representation in the form of a map describing the infrastructure and progress status. The RPO figure is an estimate and is well below the guaranteed 2 hour maximum, thankfully.
We can also see the replication health within replicated items of the recovery services vault. The warning is saying we haven't conducted a test failover yet, so let's do that. Back in the VM's disaster recovery window, I'll click Test Failover. We've got the failover direction from zone 1 to 3, which is correct, and the recovery point is the latest. Recall I said that zone to zone recovery could use the VM's existing Vnet, unlike region to region? The flip side of that is you can't use the production virtual network for testing a failover. Under Azure virtual network, I'll select failovertest_vnet and click OK. That'll take a little while, so I'll fast forward until it's done. Once the test has been completed, we can see a test VM created in the target resource group. I'll clean up the failover test by clicking the cleanup button, checking delete failover test VMs, and clicking OK.
Now that we've tested successfully let's do an actual failover by clicking the appropriate button. I'll go with the latest recovery point, have the VM shutdown before failing over, and click OK. After the failover has been completed, we can go and look at virtual machines and see the mirror machine is up and running and the source VM has stopped. Looking at the target machine with the same name, we can see that it is indeed located in availability zone 3. The final failover act is committing the failover; it just involves clicking commit and then OK. Now that we're running in zone 3, which has become our source machine for all intents and purposes, we need to protect it by replicating back to the original zone 1 VM. We do this by clicking the re-protect button on the zone 3 machine, followed by OK, which starts the synchronization process. The infrastructure view shows that the zone 1 and 3 have swapped places reflecting the failover. There's only one thing left to do, and that's failback to the original configuration. Nothing special. We failover the zone 3 machine, committing the failover as before, and once that has completed, re-protect our original primary zone 1 virtual machine.
While Windows Server 2016 and 2019 are supported, Microsoft "strongly" recommends using the 2019 Datacenter version as its Failover Cluster Service is Azure aware. Other cluster configuration improvements are around the cluster network name and IP address, and internal load balancing. The basic high availability design is two VMs sharing two Azure shared disks. This configuration supports Enqueue Replication Server 1 and 2, but the Enqueue Replication Server versions cannot be mixed within the cluster. All VMs in the cluster must be in the same proximity placement group. Azure shared disk maxShares parameter stipulates how many nodes in the cluster can share the disk, so typically, this would be set to two.
Clustering central services on Linux using Pacemaker is, in theory, a similar, if not a slightly more complex scenario than with Windows server. The shared disks are replaced by NFS shares that can be hosted on a highly available NFS file server or on NetApp Files NFS volumes. Like the windows configuration, it can support Enqueue Replication Server 1 and 2, but not both in the same cluster. Both SUSE and Redhat distributions can support a maximum of 5 nodes in a cluster, but at this time, multi-SID clustering is only available for ABAP central services and Enqueue Replication server.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.