image
Disaster Recovery
Start course
Difficulty
Intermediate
Duration
1h 8m
Students
7861
Ratings
4.7/5
starstarstarstarstar-half
Description

Google Cloud Platform (GCP) lets organizations take advantage of the powerful network and technologies that Google uses to deliver its own products. Global companies like Coca-Cola and cutting-edge technology stars like Spotify are already running sophisticated applications on GCP. This course will help you design an enterprise-class Google Cloud infrastructure for your own organization.

When you architect an infrastructure for mission-critical applications, not only do you need to choose the appropriate compute, storage, and networking components, but you also need to design for security, high availability, regulatory compliance, and disaster recovery. This course uses a case study to demonstrate how to apply these design principles to meet real-world requirements.

Learning Objectives

  • Map compute, storage, and networking requirements to Google Cloud Platform services
  • Create designs for high availability and disaster recovery
  • Use appropriate authentication, roles, service accounts, and data protection
  • Create a design to comply with regulatory requirements

Intended Audience

This course is intended for anyone looking to design and build an enterprise-class Google Cloud Platform infrastructure for their own organization.

Prerequisites

To get the most out of this course, you should have some basic knowledge of Google Cloud Platform.

Transcript

In an earlier lesson, we covered how to design a highly available architecture that will keep running even if an instance fails, by using load balancers, instance groups, and redundant databases. However, there are more catastrophic events that might occur. I’m not talking about an entire city getting destroyed or anything like that (although it would be good to have an architecture that could handle that). But much smaller incidents can be disastrous too. For example, one of your databases could become corrupt. This is actually worse than the database server going down because it may take a while before you realize there’s a problem, and in the meantime, the corruption problem could get worse.

To recover from this sort of disaster, you need backups along with transactional log files from the corrupted database. That way you can roll back to a known-good state. Each type of database has its own method for doing this.

If you’re using Cloud SQL to run a MySQL database (which we are for the interior design application), then you should enable automated backups and point-in-time recovery. Then if your database becomes corrupt, you can restore it from a backup or use point-in-time recovery to bring the database back to a specific point in time. 

If you’re using Cloud SQL for SQL Server, then you should enable automated backups. At this time, Cloud SQL does not support point-in-time recovery for SQL Server, so you can only restore a database to the point when a specific backup was taken. For both types of databases, Cloud SQL retains up to 7 automated backups for each instance.

If you’re hosting a database on Compute Engine instances directly, then you’ll have to configure backups and transaction logging yourself. For example, suppose that instead of using Cloud SQL for SQL Server, we ran SQL Server on a Compute Engine instance. Then we’d need to set up our own disaster recovery solution for it. Luckily, Google has a very detailed white paper on this topic. I’ll give you the highlights.

First, set up an automated task that copies the SQL Server database backups to Google Cloud Storage. This is where we would finally need a service account because instances can’t write to Cloud Storage by default. The SQL Server instances need to have a service account with the Storage Object Creator role. Another way to do it would be to set a Cloud Storage access scope for the instance, but service accounts are more flexible.

Once the database is being backed up, then if disaster strikes, you would spin up a new SQL Server instance. Either use one of Google’s preconfigured SQL Server images or your own custom disk image. It doesn’t mention this in the whitepaper, but it’s the sensible thing to do and I’ll talk about it more in a minute. Next, you can use an open-source script to restore the database and re-execute the events in the log files up to the point in time desired.

When you’re designing a disaster recovery solution, you need to consider RPO and RTO. RPO stands for Recovery Point Objective. This is the maximum length of time when data can be lost. It affects your backup and recovery strategy because, for example, if it’s acceptable to lose an entire day’s worth of work, then you can just recover using the previous night’s backups. If you have a short RPO, which is usually the case, then you need to make sure you are constantly backing up your data, and when recovering from database corruption, you have to carefully consider which point in time to recover to.

RTO stands for Recovery Time Objective. This is the maximum length of time that your application can be offline and still meet the service levels your customers expect (usually in a service level agreement).

In the SQL Server example, I suggested using either one of Google’s preconfigured SQL Server images or your own custom disk image that has SQL Server installed and configured. The advantage of having a custom disk image is that it helps you meet your recovery time objective because it reduces the amount of time it takes to get a new SQL Server instance running. If you have to configure SQL Server manually, that could significantly impact how long it takes to recover from a disaster.

As with everything, though, there are tradeoffs. If your SQL Server implementation is customized, then you’ll have to weigh the benefits of fast recovery time against the maintenance effort required to keep your custom image up-to-date . If you have a very short RTO, then you may have no choice but to maintain a custom disk image. You might be able to ease the maintenance required, though, by using a startup script to perform some of the customization. Since the startup script resides on either the metadata server or Cloud Storage, you can change it without having to create a new disk image.

In some cases, you may want to run an application from your own data center or from another cloud platform and use Google Cloud as a disaster recovery solution. There are many ways you could do this, but I’ll go over a couple of common designs.

The first way is to continuously replicate your database to an instance on Google Cloud.  Then you would set up a monitoring service that would watch for failures. In the event of a disaster, the monitoring service would trigger a spin-up of an instance group and load balancer for the web tier of the application. The only part you would need to do manually is to change the DNS record to point to the load balancer’s IP address. You could use Cloud DNS or another DNS service for this.

This is already a low-cost solution because the only Google Cloud resource that needs to run all the time is the database instance. But you can reduce the costs even further by running the database on the smallest machine type capable of running the database service. Then if there’s a disaster, you would delete the instance, but with the option to keep the persistent disk, and spin up a bigger instance with the saved disk attached. Of course, this solution would require more manual intervention and would lengthen your downtime, so you wouldn’t want to do this if you have a short RTO.

If you want to reduce your downtime as much as possible, or even keep running in the event of hardware failures, you could serve your application from both your on-premises environment and your Google Cloud environment at all times.  That way if you have an on-premise failure, the Google Cloud environment would already be running and serving customers. It would just need to scale up to handle the extra load, which would be automatic if you use an autoscaling instance group.

To make this hybrid solution work, you would need to use a DNS service that supports weighted routing, so it could split incoming traffic between the two environments. In the event of a failure, you would need to disable DNS routing to the failed environment.

And that’s it for disaster recovery.

About the Author
Students
201283
Courses
97
Learning Paths
167

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).