1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Adding Resiliency to Google Cloud Container Engine Clusters

Container Operations


Course Introduction
Adding Resiliency to Google Cloud Container Engine Clusters (GKE)
Resiliency Defined
Course Summary
Start course


Resiliency should be a key component of any production system. GKE provides numerous features focused on adding resiliency to your deployed containerized applications and services on Google Cloud Platform, allowing them to efficiently and reliably serve your applications and services.

Intended audience

This course is for developers or operations engineers looking to expand their knowledge of GKE beyond the basics and start deploying resilient production quality containerized applications and services.


Viewers should have a good working knowledge of creating GKE container clusters and deploying images to those clusters.

Learning objectives

  • Understand the core concepts that make up resiliency.
  • Maintain availability when updating a cluster.
  • Make an existing cluster scalable.
  • Monitor running clusters and their nodes.

This Course Includes

  • 60 minutes of high-definition video
  • Hands-on demos

What You'll Learn

  • Course Intro: What to expect from this course
  • Resiliency Defined: The key components of resiliency.
  • Cluster Management: In this lesson, we’ll cover Rolling Updates, Resizing a Cluster, and Multi-zone Clusters.
  • Scalability: The topics covered in this lesson include, Load Balancing Traffic and Autoscaling a Cluster.
  • Container Operations: In this lesson, we’ll demo Stackdriver monitoring and take a look at the Kubernetes dashboard.
  • Summary: A wrap-up and summary of what we’ve learned in this course.


So in this section, we're gonna talk about container operations. And this is a topic that's really at the core of SRE, that's Side Reliability Engineering. This is running your container cluster. This is running your production workloads. Monitoring them, making sure that you have alerts set up for certain scenarios so that your containers stay reliable, they stay scalable, they stay available.

You're monitoring to make sure that nothing unexpected happens. But when the unexpected does happen, that you react appropriately, because you're being alerted to that. And you're using the tools appropriately to make sure that you have the visibility into your container cluster that you need. And there's a couple of different tools that you can use to do that.

Probably the main one is going to be Stackdriver. And this is a product that is integrated into GCP. It provides extremely approachable and just really powerful monitoring, logging, and diagnostics. It gives you the visibility into the health, performance, and availability of your containerized deployments, enabling you to see what's happening, resolve bugs and performance issues easier and faster. Like I said, it's integrated into GCP. And so it's really friction free, from an adoption standpoint. And you'll see that as we look at the stack driver portal and what we can see. And how it's integrated into GCP.

There's a couple of different tiers that we can use for Stackdriver. The first is basic. And this is a free tier. Where you can pay, if you exceed your allotments. One caveat here is Stackdriver is available to use for both GCP and AWS. And you can blend the two together if you want to monitor resources in both Clouds. But you cannot use Stackdriver with AWS if your account is in the basic tier. Now the premiere tier, on the other hand, has the ability to monitor a blend of resources, like I mentioned, across both GCP and AWS. And this is especially helpful when you have cross or a multi cloud deployment strategy.

With the premium tier you have a larger allotment of resources for logs and metrics. And then, as with the basic tier if you exceed this allotment, you can be billed for your overages. And all new Stackdriver accounts start with a 30 day premium tier trial. When that trial expires, your service is reduced to the basic tier, unless you choose to upgrade to the premium tier.

So here's a comparison of the two. You can see that the various features like price, supported clouds, as we mentioned. You'll also see some of the allotments that we've got for logging and metrics and alerting policies. Where with the basic tier, you'll see that it's free. But you can only use it on GCP. You've got a logging limit. And also a retention limit for that. Now you've also got some allotment differences for metrics limitations on alerting policies.

So when things get started, just try the basic tier. And if you find that you need more, absolutely migrate on up to the premiere tier. So if we look at a monitoring page, which we'll do live in just a second, for a specific cluster. It lists all of the pods running in your cluster. It lists some recent events for that cluster, as well as you'll see some graphs of the current usage across the nose within your cluster. And from here, just like in most monitoring packages and looking at a dashboard, you can easily drill down to the details of the individual pods and containers.

Now that's where you see meta data about that pod, it's containers, how many times they've been restarted. Along with metrics about the resource usage. And you can also dig into the related pieces of, or the components of GCP, like the compute engine instances that are running underneath container clusters. So you can monitor those as well, as they are supporting your container clusters. So we're gonna go over. We're gonna take a look at what this looks like from the portal. How it's integrated. And just take a look at it in real time. We're back over in the portal.

We wanna take a look at Stackdriver monitoring for our container clusters. One thing that I forgot to mention. We're looking through the slides and the different features for Stackdriver monitoring for a container engine. Is that it's something that you have to opt in for. When you create your container cluster you have to either select or deselect whether or not you want Stackdriver monitoring associated with that cluster.

So we're not gonna create another cluster, but we will go just to the splash page so I can show you where it is. So when you create a new container cluster one of the things that you need to choose is logging and monitoring. Are you going to turn this on? Or are you gonna have it off? So if I were to deselect this checkbox I would not see any of the data for our container cluster within our Stackdriver portal.

So, luckily, I have a container cluster gke monitoring created with Stackdriver enabled. So here I see Stackdriver monitoring is enabled for this container cluster. I also have our gke resiliency cluster. If you wanna look, Stackdriver monitoring is disabled. So we will only see data for our gke monitoring cluster within the Stackdriver portal. Before we go to the portal, which I already have up in another tab, I'll show you how to get to it. We'll bring up the hamburger menu. And we just want to scroll down to the Stackdriver section. And once we get there, we'll click on monitoring. And this is going to bring up our Stackdriver portal. So here we've just got a splash page.

This will give you some very high level information on everything that is associated with this Stackdriver account. Which happens to be just my Cloud Academy gcp project. What we're really interested in is let's look under resources. And once this loads, I'm going to see everything I can look at. So under gcp we wanna take a look at, or not under gcp, I'm sorry, under infrastructure, we wanna look at container engine. So this is gonna bring up the landing page, specific to container engine, which is gonna look a lot like what we just looked at a minute ago. And we're gonna see our cluster details. We're gonna see our cpu usage. We're gonna see our Disk IO. And then if we had enough time, sometimes this takes a while to load, sometimes it doesn't, depends on the health checks.

For our pods we would see all of our pods associated with our gke monitoring cluster here. But for this, now we can see any events that have happened. Nothing here, we can see our high level cluster details. What version our master Kubernetes is on, as well as the note version. And we can see if there are any instance associated with this. And then over to the right we see our graphs where we have cpu utilization, disk IO, as well as network traffic. And you can look at some details associated with those.

You can also create an alerting policy. So this is where it gets a little bit more interesting, when you look at kind of what you can do with this and not just monitor but react. Which is really important when you look at reliability engineering. So let's create an alerting policy. And this is going to go to a generic page. So this alerting policy page is not really associated with container engine. So we'll look at, really quick, I just wanna show you, we're gonna look at conditions.

But I also wanna look at notifications. So when we look at notifications this is going to be how we notify you. So this is gonna be, we can notify you by email, we can use SMS, we can integrate with things like HipChat or Slack. We can also use PagerDuty. And then there's some advanced options depending on what you wanna pay for. You can also just have a Cloud console mobile app push notification. And so that's the various options as to how you get notified.

And so when those notifications occur, you can also define via a markdown, something else to include as part of that notification. And then, obviously, we wanna give a policy a name. But most importantly, let's go back to the conditions. And look at what conditions we want to possibly alert on for our container engine cluster. So we want to add a condition. And let's look at, probably what would be interesting the most would be some metric threshold.

I can imagine we would be interested in maybe when our cpu utilization crosses a certain threshold. And so we can look at this. So we can look at our resource type. Now we can also monitor our compute engine instances, right? That happen to be associated with our container clusters. But we can also go in and we want to look at a gke container. And what do we want to look at? Well let's look at gke monitoring. Now, we're looking at our entire container cluster. And we want to look at, if our cpu utilization for that cluster gets above, say, 80% for how long, let's say five minutes. If that happens then I want to receive an alert. And then you can also get quite fancy and you can have some amount of functionality that maybe automatically responds to an alert.

Maybe you send it to a Cloud function. Or maybe you send it to some other automated piece of functionality. That upon this alert it reacts. And maybe that is automatically spending out more resources. Maybe that is adjusting your load balancing strategy, Whatever it may be. Or whether it's just pure notification, "Hey, something weird is happening." You know, maybe you've got an auto scaling container cluster, that's auto scaled all the way up to where you thought it needed to go. But you're actually still hitting a threshold. So you need to know that you now need to go in and update that auto scaler. So that's really what we look at. Kind of the broad strokes, what you might look at when you're monitoring specifically a container cluster, in Stackdriver.

Let's cancel this and kind of look at what else we might look at. So, underneath our container cluster we know we have compute, right? So we can also look at that. So that's gonna be under our instances. And so all of these instances are associated with a container cluster, because that's all we have running right now. So you'll see gke resiliency, gke monitoring. So you can take a look at the information, the CPU utilization for all of your underlying compute engine instances as well. And in addition to that, dig in and let's look specifically at one of these instances and look at what's actually happening.

We can look at endpoint latency, CPU utilization, Disk IO, et cetera, et cetera. So we can look at this in a more granular way. Looking at the actual, underlying instances that back up our container clusters. And we can set checks on these individual instances if we want to. I don't know that I would in this case. I think it's a good idea to really think about things, holistically, at your container cluster level. But you can certainly look more granularly.

Especially in some cases where you might have specific machine types that you want to know how they are performing. Looking at the memory, to make sure that they're performing as expected. Like in the instance where we just talked about having some high memory compute instances, specifically used for types of load. Well I absolutely love Stackdriver and everything that we get from that. The monitoring, the alerting, everything in one nice package, completely integrated with GCP.

We do have some other options. And one of those is something that's been around for a while. Pretty much as long as Kubernetes has been around. And that's the Kubernetes Dashboard. And this is a general purpose, open source, web based UI, for Kubernetes clusters. And it allows you to manage applications running in the cluster, troubleshoot them, as well as manage the cluster itself. So it gives you a little bit more. It's a little bit of a combination of what we see in Stackdriver. And then also some of the functionality with the GCP itself. Both as service though the command line as well as the portal. And so previously, how I would have gotten to this is a little different than I would now.

So I'll talk about that first. I would probably, check to see if it's already part of my Kubernetes distribution. If not, I could have gone to the dashboard get help page. And I could have cloned it and deployed it that way. It's very easy to get to. You essentially would just clone the dashboard. And then run a kubectl proxy command that actually runs the dashboard. So, that was very easy. It's actually even easier now, with GCP. As I'll show you now from the portal. We're back over in the GCP portal. We wanna take a look at our container clusters. And, as we just talked about a second ago, I wanna bring up the Kubernetes Dashboard associated with one of these. I talked about how I might do that before, but Google's made that even easier for us now. Giving us some kind of tailored instructions on how to do that.

So I'm gonna click into one of our clusters. And once we get there, we're gonna have a link that says, "Connect to this cluster." And it's going to give us two command line operations that we can perform to get our Kubernetes Dashboard up and running. The first of these I don't have to do. I've already got the appropriate credentials, within Kubernetes, associated with this container cluster, gke dash monitoring. So, that first step I can skip. And now all I have to do is I'll copy this kubectl proxy command. We'll go over to our terminal window and execute that command. And what this is gonna do, is this is gonna spin up the portal for me, locally on my machine. On local host, port 8,001. And it's gonna get information for me for the specific container cluster that's running in GCP.

So, Google has also made getting to this quite easy for me. So all I have to do is click on this link. And it's gonna bring up that Kubernetes Dashboard. And as I mentioned before, this is gonna be a lot of the same information that you can get in Stackdriver. As well as some of the functionality that you might get within GCP, within the portal and the command lines. So you'll see things like CPU usage for your entire container cluster, memory utilization. You can also look at the deployments. You can look at your replica sets. You can look down at the pod level. You can also look at things like our ingress, which we created.

So if we look at our ingresses, you'll see there are basic ingress that we created via the command line as well as within the portal. And then you'll see the endpoint associated with that ingress. There's just a ton of information when we go back. There's a ton of information that you can see. So definitely, spin this up and dig through it. I think, at the end of the day, you'll probably figure out that you'll use this occasionally. But you'll probably lean on Stackdriver more often than not. Especially as Stackdriver matures and more monitoring, and alerting, and reactive functionality is put in that. Especially given how integrated Stackdriver is with GCP, the portal, and the rest of the platform.

About the Author

Steve is a consulting technology leader for Slalom Atlanta, a Microsoft Regional Director, and a Google Certified Cloud Architect. His focus for the past 5+ years has been IT modernization and cloud adoption with implementations across Microsoft Azure, Google Cloud Platform, AWS, and numerous hybrid/private cloud platforms. Outside of work, Steve is an avid outdoorsman spending as much time as possible outside hiking, hunting, and fishing with his family of five.