1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Adding Resiliency to Google Cloud Container Engine Clusters

Cluster Management


Course Introduction
Adding Resiliency to Google Cloud Container Engine Clusters (GKE)
Resiliency Defined
Course Summary
Start course


Resiliency should be a key component of any production system. GKE provides numerous features focused on adding resiliency to your deployed containerized applications and services on Google Cloud Platform, allowing them to efficiently and reliably serve your applications and services.

Intended audience

This course is for developers or operations engineers looking to expand their knowledge of GKE beyond the basics and start deploying resilient production quality containerized applications and services.


Viewers should have a good working knowledge of creating GKE container clusters and deploying images to those clusters.

Learning objectives

  • Understand the core concepts that make up resiliency.
  • Maintain availability when updating a cluster.
  • Make an existing cluster scalable.
  • Monitor running clusters and their nodes.

This Course Includes

  • 60 minutes of high-definition video
  • Hands-on demos

What You'll Learn

  • Course Intro: What to expect from this course
  • Resiliency Defined: The key components of resiliency.
  • Cluster Management: In this lesson, we’ll cover Rolling Updates, Resizing a Cluster, and Multi-zone Clusters.
  • Scalability: The topics covered in this lesson include, Load Balancing Traffic and Autoscaling a Cluster.
  • Container Operations: In this lesson, we’ll demo Stackdriver monitoring and take a look at the Kubernetes dashboard.
  • Summary: A wrap-up and summary of what we’ve learned in this course.


In this section we're going to cover cluster management, and this includes managing the geographic distribution of your cluster, resizing clusters, performing rolling updates to running clusters, and connecting clusters to other GCP platform components like Cloud SQL.

So multi-zone Container Engine clusters are primarily used as a way to improve availability of your application in the unlikely event of a zone outage. And when you create a multi-zone cluster, Container Engine makes that underlying or supporting resource footprint the same across all zones. So that is, the managed instinct groups are the same size in all zones.

For example, suppose you request four VM instances with four cores each and you ask for you cluster to be spread across two zones. That would get you 16 cores with eight cores allocated to each of the two zones. And the reason for spreading resources evenly across zones is to ensure that pods of containers get scheduled evenly across zones. And this improves availability and failure recovery. And if computing resources were spread unevenly across the zones, the scheduler that's used under the covers might not be able to spread pods evenly across the zones, even though it makes a best effort to do so.

Now remember that we can have multiple node pools associated with the container cluster. And all node pools within one of these multi-zone clusters will be replicated to the zones of the cluster automatically. And any new node pool that is created will automatically be created in those zones.

Okay, like most things GCP, creating a multi-zone cluster can be performed via the CLI, via the portal, or via the APIs. It's also good to note that this is one of those few properties of a container cluster that can actually be updated post creation of that cluster. And all of the zones used for a multi-zone deployment must be in the same region. So in our example that's going to be US East one.

We're over in our browser and we're looking for the portal for GCP. And we're in the Container Engine section. And what I wanna show you is looking at the multi-zone capability of a Container Engine cluster and how we might set that via the portal. So I've got two examples set up. Both of these are in the US Central 1-A zone.

So if we take a look at one of them it doesn't matter, they're pretty much the same. We'll look at GKE resiliency and if we look at the properties of this zone you'll see the size, you'll see the master zone, you'll see the node zone, and right now this is just set to a single zone. What we wanna do is we wanna update this to be multi-zonal, and you would go through this same process if you were setting this up from scratch. But in this case, like I mentioned before, this is one of those properties of an existing cluster that we can actually edit. So we're going to do that now, we're going to edit this cluster.

We're gonna look at the main properties for this and if you look at the master zone you'll see that this is in US Central 1-A. Now we have the option to add all of the additional zones that are also in that same master zone of US Central one. So we've got F, B, and C so if we select these additional zones, let's just select all three. And once we save our container cluster, it's going to apply these changes and then what we're going to have at the end of this update is a multi-zone cluster that is actually spread across all of those zones. So we're now going to have, what is that, four times as much or as many resources as we did before we updated this to be multi-zonal. And now we have it spread across geos.

When increasing the size of the container cluster, the new instances are created with the same configuration as the existing instances and existing pods are not moved onto the new instances, but new pods such as those created by resizing a replication controller will be scheduled onto the new instances. And replication controllers handle scheduling resources in and out of the pool and will grow as needed.

The main drive or reason for resizing a cluster is a change in utilization needs whether that be an increase or a decrease in demand. What we're talking about here is a manual process where you're committing a size change to your container configuration. Ideally this is something that will in most situations be handled via the auto-scaler functionality, which we'll talk about in just a minute.

That said, there will also be use cases like load management or maybe throttling where you might want to manually control the size of your cluster. Another use case is when you need a lot of resources to spin up immediately for use and don't really want to wait on something to auto-scale up based on load increasing over a period of time. And when resizing a cluster with a node pool that spans across multiple zones, the size represents the number of nodes in the node pool per zone.

So for example if you have a node pool of size two spanning across US Central 1-A and US Central 1-B. The total node count will be four. If you then resize this cluster to size four the total node count will be eight spread across those two zones. Now to resize a container cluster you can run one of two commands if you're doing this from the CLI.

The first command that you see will resize the cluster. If you have multiple node pools, you need to specify which node to resize by using the node pool flag in the second CLI command that you see. You are not required to use the flag if you just have a single node pool. Now we're going to go over to our console to look at how you would do this via the portal UI.

Now we're back over in the portal looking at one of our specific container clusters. What we wanna do is we want to show how to resize this. And it's one of those things that's very easy, another one of those things like adding multiple zones that you can do after a container is already created, obviously if you're resizing it. So we can do this via the portal by editing our cluster. And it automatically bumps the UI down to the node pool which is actually what we want to resize.

But if you scroll up, you'll see that we have the total size here, and this is just a size shown that is essentially an aggregate of the node pools. So when we resize our node pool that will be shown for the container cluster as a whole as well. But if we scroll down to our node pool, this is essentially how easy it is to resize this. So if you wanna go from three to five we just increase to five here, that's it. And then we save it.

If we had multiple node pools we could resize them accordingly by editing the specific node pool and changing the size of each pool. So let's actually take this down. If we change this to two and then hit save, that's going to go through the process. It's going to resize the node pool for us and then the resulting container cluster and now we see that that change has taken effect and now we have a container cluster of size two. So if we go back and look at our list of container clusters, there we go, that change is reflected. So we now see that that is a cluster size of two. And remember that this is a manual change, so in probably 90% of the scenarios you're going to want to use an auto-scaling solution that's gonna scale up or down based on your capacity and needs.

As mentioned before, there are scenarios where this makes sense where you need to guarantee that for a period of time or immediately you have a certain amount of capacity within one of your container clusters. For me, rolling updates are one of those things that are just extremely cool. It's an extremely cool part of containerization in general or really just distributed architectures in general. And a rolling update is the process of updating an application or service, whatever it may be, instead of microservices, whether it's a new version of just an updated configuration in a serial step by step fashion.

By updating one instance at a time you're able to keep the application up and running. For instance, if you were to just update all instances at the same time, your application would likely experience some down time or at the very least some amount of experience degradation. And just a very simple process for deploying a rolling update would be update your local docker image.

This might be on your local machine or part of your build process. You push that updated image to container registry, and then finally you update your container cluster deployment with the updated image, that last step being probably the most important which is where the rolling part of the rolling update occurs. So Container Engine's rolling update mechanism ensures that your application remains up and available even as the system replaces instances of your old container image with your new one across all running replicas. And you can create an image for a new version of your application by simply building your source code and tagging it with V2 or whatever, tagging it as a new version.

If we look at the commands that we have here, this maps to the process we just looked at where we're updated a docker image. We're pushing that image up onto Container Registry and then we're using the Kubectl command line to set that image for deployment that is then going to perform a rolling update for our container cluster. So we're getting ready to demo setting or deploying a new image into a running container cluster. So before we do that I wanna talk a little bit about what's going to happen, which we just say on the slide. So we've got another image that's already deployed to our container registry, so we've got our image in the cloud. And that image is tagged as GKE V2. And now we want to take that image that's already in the cloud and run the Kubectl set image command so we create a deployment into our container cluster and is going to update our nodes with that image. So if we execute this, what's going to happen is it's going to execute the deployment.

So now what's happening behind the scenes is Kubernetes is doing a rolling update and performing that update. It happens very fast in this case 'cause it's a fairly small image and a minor change, but what happens now is we can take a look at, look at our container cluster and if we do a Kubectl command to just get pods. There we go, now we'll see that we've got our GKE monitoring and we've got one that's in status terminating. So what I did was I actually deployed an image that will fail, so we'll be able to see, if we look at that again, we should see it cycling through a crash loop back.

Okay, so that's essentially when something goes wrong with a container where it cannot load it goes through a couple of statuses where you see something going wrong. But the crux of this is that we've deployed an image to our running container cluster that does not work, so it's a good way to see that yes, it definitely did the rolling update but now we've also got a failing container cluster. So what we're gonna look at next in just a second is how to remedy this.

Well we just got finished looking at a demo of how to perform a rolling update with Kubernetes. Unfortunately, our rolling update failed. It was planned of course for the demo, but unfortunately it failed, so now we need to do something about it. In this case, we just need to roll back to our last deployment which we know was good. We also have the option to resolve a deployment issue.

You can roll back to just a previous deployment that is stable, you could also roll back to a specific revision that you know is stable or that you want to specifically go to for one reason or another.

The first command that we have here will list out the history of a specific deployment. So you can see all of the different revisions that have been performed against that deployment.

The second command will roll back the deployment just to the previous deployment which we're going to do in just a second.

The third command, we roll back that deployment to a specific deployment and revision.

So if you want to go four revisions back you can do that. Once the deployment is rolled back to a previous stable revision, you can see a development rollback event for rolling back to the revision is generated from the deployment controller. Now we're back over in our terminal window where we can fix our failed deployment that we just performed, so we're going to perform a rollback.

But before we do that I wanna show you part of the functionality that's getting lit up within the portal. We look at Container Engine and we've got some more tabs that we didn't necessarily see before depending on when you last looked at the portal. One of those is workloads. And if we look at this right now it says that our deployment is okay. And this is going to list our deployments for our various container clusters. And we can dig into those container clusters.

We clicked on that, we're going to have some more details on what's available, what might be going wrong as well as we can look at our revision history. So you'll see a lot of revisions for this specific deployment where we've got revision nine, revision eight, where we've got one that we know that Hello Node Revision works, that GKE V2, that's the one that does not work.

So we want to make sure that we go to revision nine and I'll show you how to do that in just a second. And if we look back at the work loads now that this is refreshed, we'll see that our container cluster, the workload associated with that cluster is in a essentially failed status. And just to go back really quickly what we're going to do is we're going to take this, we have a revision history, we know that revision nine was good.

Imagine that there were a lot more revisions here, but we know revision nine was good. So we're going to roll back to that specific revision and we're going to do that using the Kubectl command line. So I'm going to cheat, I've got that as one of my last executed commands. And so we're going to go back to revision eight, correct? No nine, that is correct.

Okay, we wanna go to revision nine to get this to work. And let's just walk through this command again, so we're going a rollout and undo, so we're undoing a specific deployment and we're undoing that to rollback to revision nine. Once we execute that we're going to get a command that it was successfully rolled back, then we should immediately see that reflected both when we're looking at the deployment within our command line but also we can go over to our UI and if we look at our workloads we should now see that this is in a status okay, so our deployment is back up and running.

I'd like you to think back to the diagram earlier that shows a highly available container cluster connected up to Cloud SQL. And as it turns out there's actually really an art to connecting to Cloud SQL from Container Engine, or really most database services, this pattern will hold true. And that art involves the creation of a proxy which acts as a go between. And the reason that this is necessary is that the Cloud SQL request could initiate from any of the pods within a cluster.

And this becomes especially important as you have increased scalability scenarios where you've got multi-geo clusters that have multiple multiple pods scaled out within. So you definitely don't want each pod to have to maintain its connection to SQL so the proxy both simplifies that as well as makes it a resilient solution.

In the process we're going to follow, using the Cloud SQL proxy is only applicable to the second generation release of Cloud SQL. Also, Cloud SQL for Postgres is still in beta so we're going to focus on talking about this with relation to Cloud SQL for My SQL. And the process you'll see on the slide is you have to create some user accounts. You create some credential secrets, you update your pod configuration and you commit those changes. And it's not really a lot of steps, but there's a lot to do within each of these steps which you'll see described at that link on the bottom.

I've also included a couple of other links. So between these two links you will have all the information and guidance that you need to connect to Cloud SQL from Container Engine. So specifically this first link provides details and examples around all of the steps necessary to complete the process.

So everything underneath those four mains steps that we saw on the previous slide will be detailed here. And then the second links to a GitHub repo that contains a working sample YAML configuration file for a container cluster connecting to a Cloud SQL instance. In the next section, we're going to talk about scalability and how that builds upon everything we talked about on container management.

About the Author

Steve is a consulting technology leader for Slalom Atlanta, a Microsoft Regional Director, and a Google Certified Cloud Architect. His focus for the past 5+ years has been IT modernization and cloud adoption with implementations across Microsoft Azure, Google Cloud Platform, AWS, and numerous hybrid/private cloud platforms. Outside of work, Steve is an avid outdoorsman spending as much time as possible outside hiking, hunting, and fishing with his family of five.