Failover in the Cloud
Start course
1h 6m

High availability disaster recovery (HADR) is an integral part of any organization’s data strategy. It is also an essential string to every DBA's bow. In an online world that operates 24 hours a day, going offline or losing customers' data cannot be tolerated. This course examines the features that Azure provides to help you make sure your SQL databases, whether they are managed in the cloud or on-premise, are not the point of failure in your systems.

High availability disaster recovery encompasses two fundamental concepts. Firstly, how to minimize the amount of time your databases will be offline in the event of unforeseen events like hardware failures, power outages, or any number of natural disasters. Secondly, it looks at how to minimize data loss when any of these events occur. It is one thing to be offline, but it is another thing to lose data that is already in your custody.

In this course, we will cover the full range of Azure options from turnkey to custom and do-it-yourself solutions. If you have any feedback relating to this course, feel free to reach out to us at

Learning Objectives

  • Understand the concepts and elements of high availability and disaster recovery
  • Learn about hyperscale databases and how they are used
  • Learn how to combine on-premise with the cloud to replicate and sync your data
  • Understand what always on availability groups are and how they operate
  • Implement geo-replication and failover groups
  • Learn about the wide range of options you have for backing up your databases with Azure

Intended Audience

This course is intended for anyone who wants to implement high availability and disaster recovery procedures in place for their Azure SQL databases.


This is an intermediate to advanced course and so to get the most out of it, you should be familiar with SQL Server management studio, database operations like backup and restore, and T-SQL, and also have some familiarity with basic active directory concepts like users and permissions.


So how does Azure SQL implement redundancy and replication across regions outside of business-critical and premium service triers? Azure SQL does implement replication, but for obvious reasons related to network latency over distance it does come with a few restrictions. Replication is asynchronous and failover must be manually initiated in the event of a catastrophic failure where regionally local alternatives are unavailable.

Geo-replicated databases in the failover group do not share the same endpoint, so this is also a hindrance to automatic failover. Although auto-failover groups will allow you to implement one endpoint across replicas. This replication does allow you to have up to 4 readable mirrored instances located in the same or different Azure regions, so it does provide a high level of redundancy.

As with always-on availability groups, you can offload backup and read-only workloads to your secondary databases. Not only can this be used as a high availability disaster recovery mechanism but could also be used for data migration or application upgrades. Even though the replication is asynchronous and is not transactional, it updates secondary databases by streaming the primary’s transaction logs, only complete transactions are ever replicated. Let’s see how we can set up active geo-replication through the Azure portal next.

If we go over to our Azure SQL Server instance and click on failover groups and then click add failover group we can go through the wizard and create a new group and then add a secondary server. My current server is on the east coast of the US so I’m going to have my second server on the West Coast.

Now I can hear you say, "But Hallam you said the failover was manual yet here we have a failover policy where we can select either automatic or manual." From a disaster recovery point of view, the automatic failover is not that useful as the minimum amount of time that it will take for your primary to switch to your secondary automatically is one hour. A lot can happen in an hour. For instance, any secondary databases within the same region that are part of a high availability solution will be triggered and operational before a database halfway around the world is able to intervene.

So for the purposes of a HADR strategy, let’s read the failover as manual. I’ll add one database to the failover group and then create it. As part of the creation, data from your primary database is migrated to the secondary one. I just connect to my new secondary read-only database through Management Studio and do a select just to confirm that the data is there and yes it is. Now I’ll just try and insert some data into the secondary and sure enough, I’m not able to do that because it is a read-only database.

Now if I go back to my primary database and insert a record into TransHeader we can see that that has been replicated to the secondary database quite quickly, that’s impressive. You can test the replication lag by querying the system table dm_geo_replication_link_status. The Microsoft documentation says that a nonzero value for replication lag seconds indicates a problem with replication.

Now if I go to the master database on my primary server and run a select from the system table geo-replication links I can see the partner server and partner database replication state and the role, which is primary, of the server I’m currently connected to. Going back to the secondary server and just confirming that it is in a read-only state, now I’ll go back to the portal and do a manual failover. This involves going into the failover group and just clicking on failover at the top of the page.

Now that my databases have traded roles I'll go back to Management Studio and do and insert on my new primary server. Going back to HowDataSync and doing a select from trans-header I can see those records have replicated across to the new secondary database.

Failover groups that are built on active geo-replication technology get us a little bit closer to high availability disaster recovery without going to the full-blown business-critical database tier. In addition to what we’ve seen with geo-replication, failover groups support SQL managed instance as well as Azure SQL databases.

In the event of a failover the connection endpoint doesn’t change courtesy of the Azure traffic manager, but in no sense of high availability is the failover automatic. The minimum one-hour grace period before an automatic failover will switch to a secondary still applies. The name of a failover group must be unique within the domain.

When you create a failover group the secondary databases are automatically created on the secondary server. The implication is that the database with the same name cannot already exist on the secondary server unless it is already a secondary database and, in that case, it will just be added to the failover group.

While it is possible to pick and choose databases from an elastic pool to be added to a failover group it is not possible to only replicate specific databases from a SQL managed instance. All user databases on a SQL managed instance will be replicated to the secondary server. When adding databases from an elastic pool you need to make sure that a pool with the same name exists on the secondary server. Failover groups must be created between SQL managed instances or Azure SQL servers in different regions. You can create multiple failover groups between two Azure SQL servers, but only one failover group between a pair of managed instances.

Unlike an on-premise scenario where your secondary server could be a lesser machine in terms of memory and CPU power, for both managed instances and Azure SQL databases, the secondary servers must be identical in terms of service tier.

Here we have a graphical representation of a failover group for Azure SQL databases. As you can see the failover group URL spans both regions, so whichever region is currently the primary will be channeled the right requests. If you decide to make use of the read-only functionality you can use that same URL amended with a secondary namespace, the actual word secondary, and once again those requests will be routed to whichever region is secondary.

Just bear in mind that your read-only data may not be fully concurrent all the time due to replication latency across regions, and it is also good practice to use ApplicationIntent=ReadOnly within your connection string to indicate your read-only intention. 

To achieve the same result using SQL managed instances you must make sure that the primary and secondary instances are in the same DNS zone. You must create gateways between your two virtual networks where the gateway type is VPN and the VPN type is route-based. As with Azure SQL database failover groups, you can use the secondary namespace within the URL and ApplicationIntent=ReadOnly within the connection string to perform reads on your secondary instance.


Course Introduction - Overview - High Availability and Disaster Recovery Concepts - Hyperscale - Combining On-Premises with the Cloud - Always On Availability Groups - Database Backups - Course Summary

About the Author
Learning Paths

Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a  Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.