The course is part of these learning paths
Mapping Needs to GCP Services
Google Cloud Platform (GCP) lets organizations take advantage of the powerful network and technologies that Google uses to deliver its own products. Global companies like Coca-Cola and cutting-edge technology stars like Spotify are already running sophisticated applications on GCP. This course will help you design an enterprise-class Google Cloud infrastructure for your own organization.
When you architect an infrastructure for mission-critical applications, not only do you need to choose the appropriate compute, storage, and networking components, but you also need to design for security, high availability, regulatory compliance, and disaster recovery. This course uses a case study to demonstrate how to apply these design principles to meet real-world requirements.
- Map compute, storage, and network needs to Google Cloud Platform services
- Create designs for high availability and disaster recovery
- Use appropriate authentication, roles, service accounts, and data protection
- Create a design to comply with regulatory requirements
In an earlier lesson, we covered how to design a highly available architecture that will keep running even if an instance fails, by using load balancers, instance groups, and redundant databases. However, there are more catastrophic events that might occur. I'm not talking about an entire city getting destroyed, or anything like that. Although, it would be good to have an architecture that could handle that. But much smaller incidents can be disastrous, too. For example, one of your databases could become corrupt. This is actually worse than the database server going down, because it may take a while before you realize there's a problem. And in the meantime, the corruption problem could get worse.
To recover from this sort of disaster, you need backups, along with transactional log files, from the corrupted database. That way you can roll back to a known good state. Each type of database has its own method for doing this.
If you're using Cloud SQL to run a MySQL database, which we are for the interior design application, then you should enable automated backups and binary logging. Then if your database becomes corrupt, you can restore the most recent backup to a new Cloud SQL instance, and then re-execute the database events in the binary log up to the last known good point. Cloud SQL retains up to seven automated backups for each instance.
If you're hosting a database on Compute Engine instances directly, then you'll have to configure backups and transaction logging yourself. For example, we have a Microsoft SQL Server in our payment processing environment, so we'll need to set up our own disaster recovery solution for it. Luckily, Google has a very detailed whitepaper on this topic. I'll give you the highlights.
First, set up an automated task that copies the SQL Server database backups to Google Cloud Storage. This is where we'll finally need a service account, because instances can't write to Cloud Storage by default. The SQL Server instances need to have a service account with a storage object creator role. Another way to do it would be to set a Cloud Storage access scope for the instance. But service accounts are more flexible.
Once the database is being backed up, then if disaster strikes, you would spin up a new SQL Server instance. Either use one of Google's preconfigured SQL Server images, or your own custom disk image. It doesn't mention this in the whitepaper, but it's the sensible thing to do, and I'll talk about it more in a minute. Next, you can use an open source script to restore the database, and re-execute the events in the log files up to the point in time desired.
When you're designing a disaster recovery solution, you need to consider RPO and RTO. RPO stands for Recovery Point Objective. This is the maximum length of time when data can be lost. It affects your backup and recovery strategy, because for example, if it's acceptable to lose an entire day's worth of work, then you can just recover using the previous night's backups. If you have a short RPO, which is usually the case, then you need to make sure you are constantly backing up your data, and when recovering from database corruption, you have to carefully consider which point in time to recover to.
RTO stands for Recovery Time Objective. This is the maximum length of time that your application can be offline, and still meet the service levels your customers expect, usually in a service level agreement.
In the SQL Server example, I suggested using either one of Google's preconfigured SQL Server images, or your own custom disk image that has SQL Server installed and configured. The advantage of having a custom disk image is that it helps you meet your recovery time objective, because it reduces the amount of time it takes to get a new SQL Server instance running. If you have to configure SQL Server manually, that could significantly impact how long it takes to recover from a disaster.
As with everything, though, there are tradeoffs. If your SQL Server implementation is customized, then you'll have to weigh the benefits of fast recover time against the maintenance effort required to keep your custom image up-to-date. If you have a very short RTO, then you may have no choice but to maintain a custom disk image. You might be able to use the maintenance required, though, by using a startup script to perform some of the customization. Since the startup script resides on either the Metadata Server or Cloud Storage, you can change it without having to create a new disk image.
In some cases, you may want to run an application from your own data center, or from another cloud platform, and use Google Cloud as a disaster recovery solution. There are many ways you could do this, but I'll go over a couple of common designs.
The first way is to continuously replicate your database to an instance on Google Cloud. Then you'd set up a monitoring service that would watch for failures. In the event of a disaster, the monitoring service would trigger a spinup of an instance group and load balancer for the web tier of the application. The only part you would need to do manually is to change the DNS record to point to the load balancers IP address. You could use Cloud DNS, or another DNS service for this.
This is already a low-cost solution, because the only Google Cloud resource that needs to run all the time is the database instance. But you can reduce the cost even further by running the database on the smallest machine type capable of running the database service. Then if there's a disaster, you would delete the instance, but with the option to keep the persistent disk, and spin up a bigger instance with the saved disk attached. Of course, this solution would require more manual intervention, and would lengthen your downtime, so you wouldn't want to do this if you have a short RTO.
If you want to reduce your downtime as much as possible, or even keep running in the event of hardware failures, you could serve your application from both your on-premises environment and your Google Cloud environment at all times. That way if you have an on-premise failure, the Google Cloud environment would already be running and serving customers. You would just need to scale up to handle the extra load, which would be automatic if you use an autoscaling instance group.
To make this hybrid solution work, you would need to use a DNS service that supports weighted routing, so it could split incoming traffic between the two environments. In the event of a failure, you would need to disable DNS routing to the failed environment.
And that's it for disaster recovery.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).