1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Designing a Google Cloud Infrastructure

High Availability


Course Introduction
Case Study
Mapping Needs to GCP Services
7m 7s
7m 51s
11m 17s
Disaster Recovery

The course is part of these learning paths

Google Professional Cloud Developer Exam Preparation
Google Data Engineer Exam – Professional Certification Preparation
Google Cloud Platform for Solution Architects
more_horizSee 3 more
Start course
Duration1h 6m


Google Cloud Platform (GCP) lets organizations take advantage of the powerful network and technologies that Google uses to deliver its own products. Global companies like Coca-Cola and cutting-edge technology stars like Spotify are already running sophisticated applications on GCP. This course will help you design an enterprise-class Google Cloud infrastructure for your own organization.

When you architect an infrastructure for mission-critical applications, not only do you need to choose the appropriate compute, storage, and networking components, but you also need to design for security, high availability, regulatory compliance, and disaster recovery. This course uses a case study to demonstrate how to apply these design principles to meet real-world requirements.

Learning Objectives

  • Map compute, storage, and network needs to Google Cloud Platform services
  • Create designs for high availability and disaster recovery
  • Use appropriate authentication, roles, service accounts, and data protection
  • Create a design to comply with regulatory requirements



Before we talk about networks, we need to talk about how we can make our applications highly available.

If you have an application that's running on only one VM instance, then of course it's a single point of failure, and if it goes down, your application goes down. So at a minimum you should always have at least two VMs for every component of your solution. But where should those instances be located?

When you create a VM, it gets created in a particular zone, such as us-central1-a. A zone is an isolated location. You can think of a zone as a data center or an isolated portion of a data center.

If you put both instances in the same zone, then both of them could potentially go down if there's a problem in that zone. So you should put the instances in different zones. For performance reasons, you may need to put them in zones that are in the same region, such as us central one. Notice that the zone name is just the region name with a dash and a letter at the end. All of the zones in a region have high bandwidth, low latency network connections between them, so if instances that are spread across a region need to mirror data with each other, then they can do this quickly.

Although region sounds like a geographic area, it's just a data center campus in one location. For example, all of the zones in the us central one region are in Council Bluffs, Iowa. So for maximum availability, you may also want to distribute your instances across different regions.

For a higher level of availability, you can use autoscaling instance groups. This was covered extensively in the Google Cloud Platform Systems Operation course, so I'll just go over the highlights.

An instance group consists of identical instances that perform processing for your application. If one of the instances fails, then a health check will notice this and replace the instance with a new one. If the load on the instance group gets too high, then the autoscaler will add more instances to maintain good application performance.

To ensure availability even if an entire zone fails, you should distribute the instances across multiple zones. Luckily this is very easy to do. You just have to select multizone when you're creating the instance group.

If you want to make sure you'll still have enough instances to handle the load if an entire zone goes down, then you should overprovision by 50%. For example, if your instances are spread across three zones, and you need six instances to handle your normal traffic load, then you should provision nine instances. That way, if one of the zones goes down, which would take out three of the instances, you'll still have six instances left in the two remaining zones.

You can either overprovision by 50% at all times or you could save money by just setting the upper limit on your autoscaler to at least 50% more than the normal number of instances. If you decide to depend on the autoscaler during a zone failure, then the instances in your remaining two zones will be very heavily loaded until the autoscaler provisions additional instances, so only choose this option if you can tolerate this temporary performance degradation.

Since GreatInside has six web tier instances for its main application, this is how it should be set up. For the two customer-facing IIS instances, in the payment processing system, you'd set an upper limit of three instances, which is 50% more than the two instances that it normally needs.

To make the instance group work as a high availability solution, you'll need a couple of other components. First, the instance group has to be behind a load balancer that will distribute incoming requests to different instances. Second, the instances cannot have any stateful data. Otherwise, the same instance would have to handle all requests from a given user. Although you can enable the session affinity option in this situation, it will ruin your high availability, since a failed instance will impact all of the users on it.

Since most applications do have stateful data, you have to put it on other components, such as a database or Cloud Storage. Unfortunately that just moves the availability issue to a different layer, but fortunately, Google Cloud has good ways to handle storage availability.

If Cloud Storage is sufficient for your stateful data needs, then you're covered because Cloud Storage is automatically replicated either across zones in a region, for the regional type, or across regions, for the multiregion type.

If you need a database for your stateful data, then there are different availability solutions depending on the data service.

With Cloud SQL, you can create a failover replica in another zone by simply checking a box when you create a Cloud SQL instance. In the event of a failure, Cloud SQL will automatically fail over to the replica. Note that the failover replica option is only available with the InnoDB storage engine on MySQL.

Since Cloud Datastore is a NoSQL database, it scales horizontally, which makes high availability easier than with Cloud SQL. Cloud Datastore automatically replicates data across zones in a region. When you create a Datastore instance, you specify which region, and it does the rest.

Bigtable is also a NoSQL database that scales horizontally, but if you want it to replicate across multiple zones, then you’ll have to configure it to do that. You can even configure it to support replication across regions if you need that. But in its simplest configuration, it only stores data in a single zone, which gives it higher performance. It’s still stored redundantly in that configuration but within the same zone.

BigQuery automatically replicates data within a region, but it's a data warehouse, so it's not suitable for realtime stateful data storage.

Cloud Spanner also automatically replicates data within a region, so it's highly available out of the box, and unlike Cloud SQL, it doesn't need a failover replica, which is a less available solution.

In summary, if a NoSQL database is sufficient for your application, then Cloud Datastore is your best choice for storing stateful data. If you need to use a relational database, then either use Cloud SQL, and set it up with a failover replica, or use Cloud Spanner for higher availability, if you're willing to pay a higher price.

Since GreatInside is going to use Cloud SQL, then you just need to set it up with a failover replica. Then there's Microsoft SQL Server. Making it highly available will be more difficult. Microsoft's recommended solution is to use always on availability groups, which is a topic that's outside the scope of this course.

And that's it for this lesson.

About the Author
Learning paths63

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).