Disaster Recovery
Start course

This course explores the compute, storage, and networking services available on Google Cloud Platform and how you can use them to construct an enterprise-class application architecture.

Learning Objectives

  • Understand how to apply enterprise principles to a design
  • Understand how to take an organization’s requirements and translate them into the appropriate compute, storage, and network components in Google Cloud Platform
  • Design a highly available solution that is capable of recovering from disasters

Intended Audience

This course is intended for anyone who wants to learn more about Google Cloud Platform.


To get the most from this course, you should already have a basic understanding of Google Cloud Platform components.




In an earlier lesson, we covered how to design a highly available architecture that will keep running even if an instance fails, by using load balancers, instance groups, and redundant databases. However, there are more catastrophic events that might occur. I’m not talking about an entire city getting destroyed or anything like that (although it would be good to have an architecture that could handle that). But much smaller incidents can be disastrous too. For example, one of your databases could become corrupt. This is actually worse than the database server going down because it may take a while before you realize there’s a problem, and in the meantime, the corruption problem could get worse.

To recover from this sort of disaster, you need backups along with transactional log files from the corrupted database. That way you can roll back to a known-good state. Each type of database has its own method for doing this.

If you’re using Cloud SQL to run a MySQL database (which we are for the interior design application), then you should enable automated backups and point-in-time recovery. Then if your database becomes corrupt, you can restore it from a backup or use point-in-time recovery to bring the database back to a specific point in time. 

If you’re using Cloud SQL for SQL Server, then you should enable automated backups. At this time, Cloud SQL does not support point-in-time recovery for SQL Server, so you can only restore a database to the point when a specific backup was taken. For both types of databases, Cloud SQL retains up to 7 automated backups for each instance.

If you’re hosting a database on Compute Engine instances directly, then you’ll have to configure backups and transaction logging yourself. For example, suppose that instead of using Cloud SQL for SQL Server, we ran SQL Server on a Compute Engine instance. Then we’d need to set up our own disaster recovery solution for it. Luckily, Google has a very detailed white paper on this topic. I’ll give you the highlights.

First, set up an automated task that copies the SQL Server database backups to Google Cloud Storage. This is where we would finally need a service account because instances can’t write to Cloud Storage by default. The SQL Server instances need to have a service account with the Storage Object Creator role. Another way to do it would be to set a Cloud Storage access scope for the instance, but service accounts are more flexible.

Once the database is being backed up, then if disaster strikes, you would spin up a new SQL Server instance. Either use one of Google’s preconfigured SQL Server images or your own custom disk image. It doesn’t mention this in the whitepaper, but it’s the sensible thing to do and I’ll talk about it more in a minute. Next, you can use an open-source script to restore the database and re-execute the events in the log files up to the point in time desired.

When you’re designing a disaster recovery solution, you need to consider RPO and RTO. RPO stands for Recovery Point Objective. This is the maximum length of time when data can be lost. It affects your backup and recovery strategy because, for example, if it’s acceptable to lose an entire day’s worth of work, then you can just recover using the previous night’s backups. If you have a short RPO, which is usually the case, then you need to make sure you are constantly backing up your data, and when recovering from database corruption, you have to carefully consider which point in time to recover to.

RTO stands for Recovery Time Objective. This is the maximum length of time that your application can be offline and still meet the service levels your customers expect (usually in a service level agreement).

In the SQL Server example, I suggested using either one of Google’s preconfigured SQL Server images or your own custom disk image that has SQL Server installed and configured. The advantage of having a custom disk image is that it helps you meet your recovery time objective because it reduces the amount of time it takes to get a new SQL Server instance running. If you have to configure SQL Server manually, that could significantly impact how long it takes to recover from a disaster.

As with everything, though, there are tradeoffs. If your SQL Server implementation is customized, then you’ll have to weigh the benefits of fast recovery time against the maintenance effort required to keep your custom image up-to-date . If you have a very short RTO, then you may have no choice but to maintain a custom disk image. You might be able to ease the maintenance required, though, by using a startup script to perform some of the customization. Since the startup script resides on either the metadata server or Cloud Storage, you can change it without having to create a new disk image.

In some cases, you may want to run an application from your own data center or from another cloud platform and use Google Cloud as a disaster recovery solution. There are many ways you could do this, but I’ll go over a couple of common designs.

The first way is to continuously replicate your database to an instance on Google Cloud.  Then you would set up a monitoring service that would watch for failures. In the event of a disaster, the monitoring service would trigger a spin-up of an instance group and load balancer for the web tier of the application. The only part you would need to do manually is to change the DNS record to point to the load balancer’s IP address. You could use Cloud DNS or another DNS service for this.

This is already a low-cost solution because the only Google Cloud resource that needs to run all the time is the database instance. But you can reduce the costs even further by running the database on the smallest machine type capable of running the database service. Then if there’s a disaster, you would delete the instance, but with the option to keep the persistent disk, and spin up a bigger instance with the saved disk attached. Of course, this solution would require more manual intervention and would lengthen your downtime, so you wouldn’t want to do this if you have a short RTO.

If you want to reduce your downtime as much as possible, or even keep running in the event of hardware failures, you could serve your application from both your on-premises environment and your Google Cloud environment at all times.  That way if you have an on-premise failure, the Google Cloud environment would already be running and serving customers. It would just need to scale up to handle the extra load, which would be automatic if you use an autoscaling instance group.

To make this hybrid solution work, you would need to use a DNS service that supports weighted routing, so it could split incoming traffic between the two environments. In the event of a failure, you would need to disable DNS routing to the failed environment.

And that’s it for disaster recovery.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).