1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Adding Resiliency to Google Cloud Container Engine Clusters

Resiliency Defined


Course Introduction
Adding Resiliency to Google Cloud Container Engine Clusters (GKE)
Resiliency Defined
Course Summary
Resiliency Defined


Resiliency should be a key component of any production system. GKE provides numerous features focused on adding resiliency to your deployed containerized applications and services on Google Cloud Platform, allowing them to efficiently and reliably serve your applications and services.

Intended audience

This course is for developers or operations engineers looking to expand their knowledge of GKE beyond the basics and start deploying resilient production quality containerized applications and services.


Viewers should have a good working knowledge of creating GKE container clusters and deploying images to those clusters.

Learning objectives

  • Understand the core concepts that make up resiliency.
  • Maintain availability when updating a cluster.
  • Make an existing cluster scalable.
  • Monitor running clusters and their nodes.

This Course Includes

  • 60 minutes of high-definition video
  • Hands-on demos

What You'll Learn

  • Course Intro: What to expect from this course
  • Resiliency Defined: The key components of resiliency.
  • Cluster Management: In this lesson, we’ll cover Rolling Updates, Resizing a Cluster, and Multi-zone Clusters.
  • Scalability: The topics covered in this lesson include, Load Balancing Traffic and Autoscaling a Cluster.
  • Container Operations: In this lesson, we’ll demo Stackdriver monitoring and take a look at the Kubernetes dashboard.
  • Summary: A wrap-up and summary of what we’ve learned in this course.


Hello, and welcome back. In this section, we're going to talk about resiliency. What it is, and why it's important for our GKE deployments. Site Reliability Engineering, or SRE for short, is a discipline that incorporates aspects of software engineering, and applies that to operations whose goals are to create ultra-scalable, and highly reliable software systems.

Defined by Ben Treynor, founder of Google's SRE team, this is what happens when a software engineer is tasked with what used to be called operations. So creating applications that are both resilient and scalable, is an essential part of any enterprise of our architecture. A well-designed application should be able to scale seamlessly as demand increases or decreases, and also be resilient enough to withstand the loss of one or more resources.

But first off I'd like to talk a little bit about scalability, which is the ability to match capacity to demand. And scalability is really inextricably linked to resiliency.

And when talking about the cloud, you will often hear the term elasticity. Elasticity is the ability to increase or decrease resources as needed to meet the current capacity needs of your application or services. So, example, a scalable web application is one that works well with one user or a million users, and gracefully handles peaks and dips in traffic automatically.

By adding and removing nodes only when needed, scalable apps only consume the resources necessary to meet demand. For an application to be resilient, it needs to be able to automatically replace instances that have failed, or become unavailable. In our diagram, we show a load balance container cluster, attached to replicated instances of cloud sequel.

Our cluster is configured for resiliency, and therefore availability. By distributing itself across regions as well as replicating across multiple nodes within the cluster itself. Now for cloud sequel, I'm showing two instances set up for read replication, where one would be the master and the other would be a read replica. Google cloud sequel provides the ability to replicate a master instance to one or more read replicas.

And what a read replica is is a copy of the master that reflects changes to that master instance in almost real time.

Okay, in the next section, we're gonna go on and we're goin' to talk about cluster management, the ability to update our clusters based on the needs of our applications and services.

About the Author

Steve is a consulting technology leader for Slalom Atlanta, a Microsoft Regional Director, and a Google Certified Cloud Architect. His focus for the past 5+ years has been IT modernization and cloud adoption with implementations across Microsoft Azure, Google Cloud Platform, AWS, and numerous hybrid/private cloud platforms. Outside of work, Steve is an avid outdoorsman spending as much time as possible outside hiking, hunting, and fishing with his family of five.