Always On Availability Groups
High availability disaster recovery (HADR) is an integral part of any organization’s data strategy. It is also an essential string to every DBA's bow. In an online world that operates 24 hours a day, going offline or losing customers' data cannot be tolerated. This course examines the features that Azure provides to help you make sure your SQL databases, whether they are managed in the cloud or on-premise, are not the point of failure in your systems.
High availability disaster recovery encompasses two fundamental concepts. Firstly, how to minimize the amount of time your databases will be offline in the event of unforeseen events like hardware failures, power outages, or any number of natural disasters. Secondly, it looks at how to minimize data loss when any of these events occur. It is one thing to be offline, but it is another thing to lose data that is already in your custody.
In this course, we will cover the full range of Azure options from turnkey to custom and do-it-yourself solutions. If you have any feedback relating to this course, feel free to reach out to us at email@example.com.
- Understand the concepts and elements of high availability and disaster recovery
- Learn about hyperscale databases and how they are used
- Learn how to combine on-premise with the cloud to replicate and sync your data
- Understand what always on availability groups are and how they operate
- Implement geo-replication and failover groups
- Learn about the wide range of options you have for backing up your databases with Azure
This course is intended for anyone who wants to implement high availability and disaster recovery procedures in place for their Azure SQL databases.
This is an intermediate to advanced course and so to get the most out of it, you should be familiar with SQL Server management studio, database operations like backup and restore, and T-SQL, and also have some familiarity with basic active directory concepts like users and permissions.
In terms of high availability disaster recovery from on-premise to the cloud, I have left the best to last. It's called Always On availability groups, and it's built on Windows failover cluster technology, but the replica, or data copy in the cloud, is SQL Server running on a virtual machine. So, it is essentially an on-premise technology that can be extended to the cloud, and forms the basis of the business-critical and premium service tier high availability and disaster recovery technology.
Always On availability groups can be likened to multiple databases being mirrored simultaneously. There is a primary set of databases which has read-write access, and there can be up to eight read-only replicas mirroring the primary with several levels of synchronicity. Not only can you achieve very high levels of redundancy distributed across data centers and regions, but because your secondary replicas have read access you can lighten the load on the primary databases by distributing read access across the replicas, or offloading backup tasks to secondary databases.
As you can imagine, setting up multiple synchronized and distributed databases is not a trivial task, and that is what we are going to investigate next. Microsoft says, "An availability group fails over at the availability replica level." Not to state the obvious, but all this means is that when your primary databases failover, they will failover to one of your secondary replicas that are called the failover target.
Essentially, you could think of the failover scenario as database musical chairs. At any one time, there is a primary database or databases. For whatever reason, whether that is an outage due to natural disaster or some planned administrative task, any one of the databases in the availability group can become the primary. The database or databases that were primary now become a secondary replica. There are two availability modes, synchronous and asynchronous, and both pretty much do as advertised.
Synchronous commit mode will ensure that the primary and replicas are fully synchronized by including them in all of the transaction, that is all inserts and updates and deletes on the replicas must be successful for the transaction to commit. This gives you the highest level of data protection and availability, but there is potentially a performance price to pay depending on the network speed and latency between your servers.
As of SQL Server 2019, synchronous commit will support one primary and four secondaries, so five replicas in total. This is up from three replicas in SQL Server 2017. Asynchronous commit also does exactly what it says. The transactional data is applied independently to the replicas. So obviously this has the potential for the replicas to fall out of synchronization. Deciding which mode to use really becomes a trade-off between data integrity and performance, and is very much dependent on your requirements.
If to be online all the time with no data loss is the paramount requirement, then you will choose a synchronous commit strategy. If performance is your goal and you can stand the risk of potential data loss, which would be comparatively small compared to other HADR technologies, then you may consider asynchronous. However, this is not an either-or situation, you can potentially have your cake and eat it too as we shall see shortly.
Failover modes determine how your availability group will behave in the event of a primary replica becoming unavailable. Automatic failover, as the name implies, will instigate a failover at the loss of the primary. As you might expect, both the primary and the target secondary replicas must have their failover mode set to automatic. Funnily enough, if either the primary or the secondary replicas are set to manual, there will be no automatic failover.
Now, automatic failover with no data loss doesn't happen by magic. It requires there to be synchronous commit replication between the primary and replicas for no data loss to happen. When you think about it, this makes sense, as asynchronous replication is not an all or nothing transactional scenario, that is the atomic property that is integral to relational databases is violated.
To summarize, for automatic failover to work, the failover mode for primary and secondary databases needs to be automatic, the databases need to be using synchronous commit replication, and both the primary and the target secondary databases must both be in a healthy synchronized state.
The above scenario can be triggered manually by a database administrator in what's called a planned manual failover. So, both the primary and the replicas need to be set up in synchronous commit mode, and both be synchronized. Even if the primary is off-line or crashed but the replication state is synchronized it is still possible to do a manual failover.
Of course, the other scenario is when your primary and replica databases aren't synchronized. In this situation, you have to do a forced failover. This must be done manually as you are overriding the default behavior and is typically the kind of action you would take in a disaster recovery scenario.
When we consider availability and failover modes, we have several permutations or combinations at our disposal. So as I said before, it is not an either-or situation when setting up a disaster recovery plan. It is possible to combine availability and failover modes to have a tiered disaster recovery architecture that spans geographical regions. In one data center, you may have primary and secondary replicas set up with automatic failover and synchronized commits, and then in another data center, in another region, you could have further secondary replicas with asynchronous commits and manual failover mode.
Each one of these combinations is called a failover set. In the first instance, you are protecting yourself from hardware failure while maintaining high data integrity and performance, and in the second instance, if the whole data center or region goes down, you still can get back up and running with potentially only a small loss of data. In theory, making a secondary replica the primary in an automatic failover process sounds straightforward, but there are several steps that need to be performed in a specific order to accomplish this.
Firstly, if the primary server instance is still running, then the primary databases must marked as disconnected and all client connections are terminated. The target secondary replica applies any log records waiting in recovery queues to bring itself up-to-date. The process of the secondary replica transitioning to the primary role, involves setting its state to "Not Synchronized". It rolls back any uncommitted transactions in the background while allowing clients to connect.
At the same time, any other secondary replicas now connect to the new primary replica and update the synchronization state to synchronized. Finally, when the former primary replica restarts or comes back online, it connects to the new primary and makes itself a secondary replica, synchronizing itself with the new primary.
Ideally, you would want more than one synchronized secondary replica as you have no idea how long the initial primary will be off-line. Here I'm going to demonstrate how you can set up Always On availability groups using SQL Server. Now, there are a couple of caveats to this exercise.
Firstly Always On availability groups are only available in the enterprise version of SQL Server, and can be only set up when running on Windows server, that is making use of Windows failover clusters. Secondly, when you set up your SQL Server instances, life will be a whole lot easier if the MSSQL service logs on with an Active Directory account rather than the default local account. Each server instance needs to have login credentials with other instances in the availability group and access to a file share. Not that this is a major issue as you can change it through the SQL Server configuration snap-in.
So here is my very basic network topology, I have three virtual Windows server machines running under Hyper-V. A domain controller and two database servers. Availability groups using failover cluster technology only works with machines joined to the same domain. This demonstration is all about always-on availability groups, so I'm not really going to touch on failover clusters in any detail except to show you the basics.
Firstly you need to make sure that the appropriate services have been installed on all the servers participating in the availability group. This is done by selecting failover clustering as a feature. The failover cluster feature needs to be installed on all the server replicates, and requires a server restart to complete the installation. I'll just quickly fast forward through the other service failover cluster installations.
Once you've done that, you can set up the cluster by going into the server manager dashboard and under Tools, selecting Failover Cluster Manager. You can use the validate cluster function to make sure you're replica machines are configured correctly before actually creating the cluster. I'll add the two database servers and create the cluster, which will end up being a machine and active directory computers with an IP address.
Cluster validation is also performed as part of the cluster creation process. Here I'll enter the IP address for the cluster, and create the failover cluster. Now that the failover cluster has been created, let's head over to one of the database servers and create the Always On high availability group using the wizard. We should be seeing Windows server failover cluster in the cluster type dropdown. The reason isn't there is that I haven't enabled Always On availability groups in SQL Servers properties. I'll just go and do that through the SQL Server configuration snapping for all the servers participating in the group.
One more thing to check or do is make sure you have an inbound firewall role that will allow the availability group service to communicate with each other's end point on port 5022. Now that it's done, I'll go back to the wizard and give the availability group a name with the correct cluster type selected.
Next, I'll add a database with a misspelled name. Before a database can be added to a group, it needs to be backed up to a location that the other server instances can access, in this case, a file share. Now that the database meets the prerequisites, I'll add the two database servers to the availability group with availability mode set to synchronous, and automatic failover.
For initial synchronization, I'm going to use the database backup I just created in the file share. I'll click Next to validate the group, and the Next again to create it. Before we look at the secondary database, I'll just show you the SQL Server account logins that are on each of the instances that allow them to interact with each other.
We can see the database under availability databases on the secondary server, and now I need to join it to the availability group. Just right click on it, and select Join To Availability Group, and click Okay. With the database joined, it will now show under the secondary service databases with the status Synchronized.
The next thing we need to do is add an availability group listener. I'll go to availability group listeners on the primary group server and add listener. I'll give it a name, specify a port to listen on, and I'll give it a static IP address. Let's see how we can add another database using TSQL commands. I'll create another database and back it up to the file share.
On the primary server, I'll add the new database with the ultra availability group, add database command. Next, I'll head over to the secondary server and restore the new database from the file share with the No Recovery option. On the secondary replica, we'll add the database to the availability group by setting its high availability disaster recovery availability group to our group with the altered database command.
Having added the second database, we can refresh SQL Server management studio and see it as in a synchronized state, good. Now, I'll do a manual failover with the ultra availability group failover command. When I refresh the servers, we can see they have swapped primary and secondary roles.
What happens if I shut down the now primary server? Well, I'll do that through Hyper-V. BEAST2019 is now the primary again, and HOWTARGET is offline, which was also reported by the Failover Cluster Manager that we can see within the nodes view. When I turned the VM back on, it assumes a secondary role as it should.
A quorum is the minimum number of members of an assembly or society that must be present at a meeting for the proceedings of that meeting to be valid. Windows failover cluster uses the concept of a quorum to decide what actions to take in the event of a node or network failure.
The problem that a quorum addresses is when there is a break in the network connecting nodes with each other and it becomes impossible for a node to decide what action to take as it is unaware of what's going on in other parts of the network. A quorum prevents the situation of multiple nodes trying to gain control of the primary role and in turn fight for ownership of shared resources.
As in real life, a casting vote is required to break the deadlock, therefore there needs to be an odd number of voters in a quorum. In the demos so far, I have only had two nodes, so to set up a quorum I need another vote from somewhere.
Quorum configuration within the cluster manager allows you to set up what's called "Witnesses" that can vote in the quorum. A witness can be a disk witness, a file share witness, or a cloud witness. Go to Failover Cluster Manager and connect to the cluster in the Windows server. Here I'm going to select more actions from the action menu, followed by configure cluster quorum settings.
Next, I'm going to add a quorum witness which will be a file share. I've created a file share on the domain controller, so not on either of the current nodes, and I've given full permissions to the how cluster active directory computer object.
Going back to the main summary screen of cluster manager, we can now see the file share witness under the cluster core resources. Within advanced quorum configuration of failover cluster management, you can manually manipulate a node's vote. This can be useful for preventing default behavior when performing a manual failover.
Advanced settings also allow you to let the cluster dynamically adjust the number of majority votes needed for a quorum. As nodes come on and go offline, the node total can vary between an odd and even number, the dynamic quorum property mitigates this type of variability. However, any nodes that have had their vote manually set will not be subject to the effects of dynamic quorum. In the event that a quorum vote prevents a cluster from starting, you can override that with Force Cluster Start from Failover Cluster Manager.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.