The Google Cloud Operations suite (formerly Stackdriver) includes a wide variety of tools to help you monitor and debug your GCP-hosted applications. This course will give you hands-on demonstrations of how to use the Monitoring, Logging, Error Reporting, Trace, and Profiler components of the Cloud Operations suite. You can follow along with your own GCP account to try these examples yourself.
If you have any feedback relating to this course, feel free to reach out to us at email@example.com.
- Use the Cloud Operations suite to monitor, log, report on errors, trace, and profile
- System administrators
- People who are preparing to take the Google Associate Cloud Engineer certification exam
- Overview of Google Cloud Platform course or experience with Google Cloud Platform
- The GitHub repository for this course is at https://github.com/cloudacademy/google-cloud-ops.
After you’ve implemented your infrastructure on Google Cloud Platform, the first thing you’ll want to do is set up a monitoring system that’ll alert you when there are major problems. The easiest way to do this is to use Cloud Operations (formerly known as Stackdriver), which is Google’s powerful monitoring and logging tool. To get to it, select “Monitoring” from the menu or type “monitoring” in the search bar.
One of the first options you’ll see is to install an agent on a virtual machine. An agent will collect more information about an instance, such as about the third-party software running on it, but you don’t need to install an agent to be able to use Monitoring, so we’ll leave that until later.
Suppose you want to monitor a web server and get notified if it goes down. First, you need to create an Uptime Check.
Since we want to check if a web server is up, leave the Protocol as HTTP and the Resource Type as URL.
For the hostname, I’m going to put in the IP address of an instance I have that’s running a web server. I want to check the base URL rather than a path on that web server, so I’m going to leave the path blank. For “Check Frequency”, we can leave it set to 1 minute. And click “Continue”.
The Response Timeout says how long to wait for a response from the web server when the uptime check runs. We can leave it at 10 seconds. If we want to check that specific content is returned in response to the request, we can enable content matching and specify what content we’re expecting. We’ll leave that disabled.
If we want every failure to be sent to Cloud Logging, then we can check this box. It’s checked by default, and we’ll leave it that way so all failures are logged.
Here’s where we say what’s an acceptable response code to the uptime check. By default, it’ll accept a response code in the 200s. We’ll click “Continue” again.
This is where we set up an alert for when the uptime check fails. You can change the name of the alert to something other than “Uptime failure” if you want, but let’s leave it. Next, we specify how long a failure has to last before it’ll trigger an alert. Let’s leave it set to 1 minute. We should also tell it how to send alert notifications. Click the dropdown menu, and then click “Manage Notification Channels”. You can be alerted by email, text message, or a variety of other options, such as Slack. We’ll get it to send an email when the web server’s down. Click “Add New”, and enter your email address and your name.
Okay, now we can close this browser tab and go back to the previous one. If your new notification channel doesn’t show up here, click Refresh. Select your email, and click “OK”. And click “Continue”.
Let’s give the uptime check a title of “Example”. Now click the “Test” button. Since the web server at that address is up, it came back right away.
Now I’m going to stop Apache on the instance that’s running the web server and test it again. This time, the connection failed, as expected. Click the Create button.
This is where we see the results of the uptime check. It’ll take a while before it runs for the first time, so don’t worry if you don’t see anything in the dashboard right away. I’ll skip ahead to when the uptime check has run. OK, now you can see it’s showing that the web server’s down. After a little while, it’ll send a notification email. Here’s what it looks like.
Now I’ll start Apache up again and see if the alert policy sees it. I’ll skip ahead a couple of minutes. Yes, it sees that the web server is up now.
To see a graph of the uptime data, click on the uptime check. This percent uptime is only for the last hour, which is why it’s so low even though the web server wasn’t down for very long.
This graph shows the number of uptime checks passed over time. When the server is up, the line is at 1. When it’s down, the line is at 0. If we hover over the line, we can see that there are actually 6 lines on top of each other. There’s one line for each location that’s running uptime checks against the server, although it only lists 5 of them in this information box.
The Uptime Check Latency graph shows when the server was down in an indirect way. It shows how long the uptime check took from each location. When the server was up, it took a fraction of a second from each location, with Singapore having the highest latency, as you’d expect considering that the web server is hosted in the US. When the web server was down, the graph shows a latency of 10 seconds for all locations. That’s because we set the timeout for the uptime check to 10 seconds. So, when an uptime check doesn’t get a response from the server, its latency is shown as 10 seconds, even though it didn’t receive a response at all.
Of course, you wouldn’t normally use this graph to see downtime events. You’d use it to check the website’s response time from different locations around the world, which would be helpful if some of your users were reporting slow performance.
If you want to see graphs of other data, such as when uptime alerts have been fired, then click on Dashboards. It provides default dashboards for many Google services. To create your own, click Create Dashboard. I’ll call it “Example Dashboard”.
To create a graph showing when uptime check alerts have been fired, first drag the alert chart icon over to the canvas. Then select the alert policy from the dropdown menu over here. Now click the “Close Editor” button, and that’s it. The graph shows the alert that fired when the web server was down.
Now suppose you wanted to monitor something completely different, such as the CPU utilization of a virtual machine instance. Click “Edit Dashboard”. This time, drag the line chart icon over to the canvas. By default, it creates a graph showing the CPU utilization of all of your VM instances. To only show data for one virtual machine, click “Add Filter”, select “instance_name” for the label, leave the comparison field set to “equals”, and set the value to the name of the VM you want to monitor. I’ll select “instance-1”, which is the VM with the web server I was running the uptime check on. Click “Done”. Now the graph is only showing data for that instance.
Okay, now suppose you wanted a graph showing memory utilization on a VM. Click the button under “Resource & Metric”. First, we select the resource type. “VM Instance” is already selected, which is what we want. Then we choose the metric category. This is where we should find the memory category, but it isn’t here. Why not? Because I haven’t installed an agent on the VM.
If we go back to the Overview page, we can set up the agent. Click on the VM, and click the “Install Ops Agent” button.
It tells us what the agent will do for us. We’ll get additional performance metrics and system logs, plus currently running processes on the instance. The agent also supports integrations with popular software products, such as Postgres and MongoDB. To install the agent, we click “Run in Cloud Shell”. It added the command for us, so we just need to hit Enter. And click “Authorize”. I’ll fast-forward to give it time to install and start sending data to Cloud Monitoring.
Okay, now if we go back to the dashboard and edit it, we can try this again. Drag the line graph icon over. Click the button under “Resource & Metric”. And now notice that the list of metric categories is longer than before, and it includes 2 memory metrics. Click “Memory” and then “Memory utilization”. Click “Apply”.
All right, there’s the memory graph. If you’re wondering why it has so many lines, it’s because it breaks down the memory utilization to show free memory, cached memory, etc. It’s only showing data for instance-1 because that’s the only instance with the Ops Agent installed.
And that’s it for monitoring.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).