1h 13m

Once you have implemented your application infrastructure on Google Cloud Platform, you will need to maintain it. Although you can set up Google Cloud to automate many operations tasks, you will still need to monitor, test, manage, and troubleshoot it over time to make sure your systems are running properly.

This course will walk you through these maintenance tasks and give you hands-on demonstrations of how to perform them. You can follow along with your own GCP account to try these examples yourself.

Learning Objectives

  • Use the Cloud Operations suite to monitor, log, report on errors, trace, and debug
  • Ensure your infrastructure can handle higher loads, failures, and cyber-attacks by performing load, resilience, and penetration tests
  • Manage your data using lifecycle management and migration from outside sources
  • Troubleshoot SSH errors, instance startup failures, and network traffic dropping

Intended Audience

  • System administrators
  • People who are preparing to take the Google Professional Cloud Architect certification exam






After you’ve implemented your infrastructure in Google Cloud Platform, the first thing you’ll want to do is set up a monitoring system that will alert you when there are major problems. The easiest way to do this is to use Cloud Operations (formerly known as Stackdriver), which is Google’s powerful monitoring, logging, and debugging tool.

To get to it, select “Monitoring” from the console menu. The first time you bring up Monitoring in a project, it’ll take a while to set up. I’ll fast-forward.

Here’s where you’ll find the instructions for installing the Monitoring Agent (and the Logging Agent). You don’t need to install the agent to be able to use Monitoring, but if you do install it, you can get more information about an instance, such as about the third-party software running on it. We don’t need to install it yet, so we’ll leave that until later.

Suppose you want to monitor a web server and get notified if it goes down. First, you need to create an Uptime Check. 

For the title, let’s call it “Example”. Click “Next”. Since we want to check if a web server is up, leave the Protocol as HTTP and the Resource Type as URL.

For the hostname, I’m going to put in the IP address of an instance I have that’s running a web server. Leave “Check Frequency” set to 1 minute. And click “Next”. We can leave the Response Validation with the defaults, so click “Next” again.

This is where we set up an alert for when the uptime check fails. First, we specify how long a failure has to last before it’ll trigger an alert. Let’s leave it set to 1 minute. We should also tell it how to send alert notifications. Click the dropdown menu, and then click “Manage Notification Channels”. You can be alerted by email, text message, or a variety of other options, such as Slack. We’ll get it to send an email when the web server is down. Click “Add New”, and enter your email address and your name.

Okay, now close this tab in your browser, and go back to the previous one. Click Refresh and select your email. Now click the “Test” button. Since the web server at that address is up, it came back right away.

Now I’m going to stop Apache on the instance that’s running the web server and test it again. This time, the connection failed, as expected. Click the Create button.

To see the results of the uptime check, go to the menu on the left and select Uptime Checks. It’ll take a while before it runs for the first time, so don’t worry if you don’t see anything in the dashboard right away. I’ll skip ahead to when the uptime check has run. OK, now you can see it’s showing that the web server’s down. After a little while, it’ll send a notification email. Here’s what it looks like.

Now I’ll start Apache up again and see if the alert policy sees it. I’ll just skip ahead a few minutes. Yes, it sees that the web server is up now.

If you want to see data graphically, then click on Dashboards. It provides default dashboards for many Google services. To create your own, click Create Dashboard. I’ll call it “Example Dashboard”.

Now click the “Add Chart” button. In this search field, type “URL”. There’s the resource type we need. It’s called “Uptime Check URL”. It gives us a few different metrics to choose from. The obvious one to choose is “Check passed”, but to make things more interesting, let’s choose “Request latency”. Click Save and the graph will be added to your dashboard.

This graph shows the network latency between each region and the web server. You can see  when the web server was down, but it gives us more information than that. This network latency data from different locations around the world can be quite helpful, especially if some of your users are reporting slow performance.

Note that you’ll need to refresh this page to see the latest data. You can turn on auto-refresh if you want.

Suppose you’d like to get more information about the instance where the web server is running, such as the CPU load. Let’s create another chart.In the search field, type “cpu load”. Let’s select the 1-minute version. For the resource type, select “VM instance”. Notice that you can even monitor Amazon EC2 instances.

You’ll notice that the chart is blank. That’s because we need to install the Monitoring Agent to get CPU data.

Go back to the Overview, and then click this link to bring up the installation instructions. The instructions are different depending on which Linux distribution you’re running on your instance. I’m running Debian, so I need to follow these instructions. I’ll fast-forward to the point after I’ve run all of the install commands.

While we’re here, let’s install the logging agent too. You can find a link to the documentation in the GitHub repository I created for this course. The link to the GitHub repository is at the bottom of the Overview tab below this video.

Installing the logging agent will prepare this instance for the next lesson when we use Cloud Logging. I’ll fast-forward again.

Now let’s go back and see what happened to our chart. Okay, now there’s a line showing the load average, so it worked.

If you’ve been following along using your own account, then you should go back and delete the monitoring you set up. First, go to the Alerting page from the menu, and then delete the policy. You have to do that before it will let you delete the uptime check. Okay, now go to “Uptime Checks” and delete the one you created. Finally, go to Dashboards, click on the one you created, and delete it.

That’s it for this lesson.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).