High Availability


Case Study
Mapping Needs to GCP Services
8m 26s
8m 23s
11m 15s
Disaster Recovery
Start course
1h 8m

Google Cloud Platform (GCP) lets organizations take advantage of the powerful network and technologies that Google uses to deliver its own products. Global companies like Coca-Cola and cutting-edge technology stars like Spotify are already running sophisticated applications on GCP. This course will help you design an enterprise-class Google Cloud infrastructure for your own organization.

When you architect an infrastructure for mission-critical applications, not only do you need to choose the appropriate compute, storage, and networking components, but you also need to design for security, high availability, regulatory compliance, and disaster recovery. This course uses a case study to demonstrate how to apply these design principles to meet real-world requirements.

Learning Objectives

  • Map compute, storage, and networking requirements to Google Cloud Platform services
  • Create designs for high availability and disaster recovery
  • Use appropriate authentication, roles, service accounts, and data protection
  • Create a design to comply with regulatory requirements

Intended Audience

This course is intended for anyone looking to design and build an enterprise-class Google Cloud Platform infrastructure for their own organization.


To get the most out of this course, you should have some basic knowledge of Google Cloud Platform.


Before we talk about networks, we need to talk about how we can make our applications highly available.

If you have an application that’s running on only one VM instance, then, of course, it’s a single point of failure, and if it goes down, your application goes down. So, at a minimum, you should always have at least two VMs for every component of your solution. But where should those instances be located?

When you create a VM instance, it gets created in a particular zone, such as us-central1-a. A zone is an isolated location. You can think of a zone as a data center or an isolated portion of a data center.

If you put both instances in the same zone, then both of them could potentially go down if there’s a problem in that zone. So, you should put the instances in different zones. For performance reasons, you may need to put them in zones that are in the same region, such as us-central1. Notice that the zone name is just the region name with a dash and a letter at the end. All of the zones in a region have high-bandwidth, low-latency network connections between them, so if instances that are spread across a region need to mirror data with each other, then they can do this quickly.

Although “region” sounds like a geographic area, it’s just a data center campus in one location. For example, all of the zones in the us-central1 region are in Council Bluffs, Iowa. So, for maximum availability, you may also want to distribute your instances across different regions.

For a higher level of availability, you can use autoscaling instance groups.

An instance group consists of identical instances that perform processing for your application. If one of the instances fails, then a health check will notice this and replace the instance with a new one. If the load on the instance group gets too high, then the autoscaler will add more instances to maintain good application performance.

To ensure availability even if an entire zone fails, you should distribute the instances across multiple zones. Luckily, this is very easy to do. You just have to select “Multizone” when you’re creating the instance group.

If you want to make sure you’ll still have enough instances to handle the load if an entire zone goes down, then you should overprovision by 50%. For example, if your instances are spread across 3 zones and you need 6 instances to handle your normal traffic load, then you should provision 9 instances. That way if one of the zones goes down (which would take out 3 of the instances), you’ll still have 6 instances left in the two remaining zones.

You can either overprovision by 50% at all times or you could save money by just setting the upper limit on your autoscaler to at least 50% more than the normal number of instances. If you decide to depend on the autoscaler during a zone failure, then the instances in your remaining two zones will be very heavily loaded until the autoscaler provisions additional instances, so only choose this option if you can tolerate this temporary performance degradation.

Since GreatInside has 6 web tier instances for its main application, this is how it should be set up. For the 2 customer-facing IIS instances in the payment processing system, you’d set an upper limit of 3 instances, which is 50% more than the 2 instances that it normally needs.

To make the instance group work as a high-availability solution, you’ll need a couple of other components. First, the instance group has to be behind a load balancer that will distribute incoming requests to different instances. Second, the instances cannot have any stateful data. Otherwise, the same instance would have to handle all requests from a given user. Although you can enable the “session affinity” option in this situation, it will ruin your high availability since a failed instance will impact all of the users on it.

Since most applications do have stateful data, you have to put it on other components, such as a database or Cloud Storage. Unfortunately, that just moves the availability issue to a different layer, but fortunately, Google Cloud has good ways to handle storage availability.

If Cloud Storage is sufficient for your stateful data needs, then you’re covered because Cloud Storage is automatically replicated either across zones in a region (for the Regional type) or across regions (for the Multi-Region type). If you need a database for your stateful data, then there are different availability solutions depending on the data service.

With Cloud SQL, you can simply check the “High availability” box when you create a Cloud SQL instance. This will create a failover replica in another zone. In the event of a failure, Cloud SQL will automatically fail over to the replica. This option is available for  MySQL, PostgreSQL, and SQL Server.

Since Firestore is a NoSQL database, it scales horizontally, which makes high availability easier than with Cloud SQL. Firestore automatically replicates data across zones in a region. When you create a Firestore instance, you specify which region and it does the rest.

Bigtable is also a NoSQL database that scales horizontally, but if you want it to replicate across multiple zones, then you’ll have to configure it to do that. You can even configure it to support replication across regions if you need that. But in its simplest configuration, it only stores data in a single zone, which gives it higher performance. It’s still stored redundantly in that configuration but within the same zone.

BigQuery automatically replicates data within a region, but it’s a data warehouse, so it’s not suitable for real-time stateful data storage. Cloud Spanner also automatically replicates data within a region, so it’s highly available out of the box, and unlike Cloud SQL, it doesn’t need a failover replica, which is a less available solution.

In summary, if a NoSQL database is sufficient for your application, then Firestore is your best choice for storing stateful data. If you need to use a relational database, then either use Cloud SQL and enable high availability or use Cloud Spanner for even higher availability if you’re willing to pay a higher price.


Since GreatInside is going to use Cloud SQL for both MySQL and SQL Server, then we just need to enable the high availability option when we create those databases.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).