Monitoring & Alerting
This course shows you how to monitor your operations on GCP. It starts with monitoring dashboards. You'll learn what they are and how to create, list, view, and filter them. You'll also see how to create a custom dashboard right in the GCP console.
The course then moves on to monitoring and alerting, where you'll learn about SLI-based alerting policies and third-party integrations. You'll also learn about SLO monitoring and alerting, along with integrating GCP monitoring with products like Grafana. We’ll wrap things up by touching on SIEM tools that are used to analyze audit and flow logs.
This course contains a handful of demos that give you a practical look at how to apply these monitoring techniques on the GCP platform. If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Create, list, view, and filter dashboards
- Configure notifications, including through third-party channels
- Learn about SLI- and SLO-based alerting and monitoring
- Integrate GCP operations monitoring with Grafana
- Analyze logs with SIEM tools
This course is intended for anyone who wishes to learn how to manage GCP Operations monitoring.
To get the most out of this course, you should already have some experience with Google Cloud Platform.
Hello and welcome to SLO Alerting Policies.
In addition to alerting on SLIs, Cloud Monitoring can also trigger alerts when a monitored service is in danger of violating an SLO. To do this, you need to create an alerting policy that is based on the rate of consumption of your error budget. We talked about error budgets earlier.
When you create an alerting policy for an SLO, Anthos Service Mesh will automatically configure most of the conditions for the alert. It will base those conditions on the settings that are configured within the SLO itself. All you have to do is specify the lookback period and the consumption percentage.
Using the default lookback period of 60 minutes as a starting point, you really need to work through some trial and error to get things configured properly. For example, what you should do to determine the consumption percentage, is monitor the behavior of whatever service you are interested in – and see what percentage of the total error budget was consumed in the previous 60 minutes. The idea is that you want to set the consumption percentage at a point where you don't burn more error budget in the lookback period than you can afford, while ensuring at the same time that you don’t set off unnecessary alerts. It can be a fine line, but, essentially, you want to generate alerts when they are necessary without generating a bunch of false positives.
To demonstrate, let’s take an example where we create an SLO that allows only for 5% of total requests within a week to exceed a latency of 200ms. This would mean that if we hit that 5% within a week, we consume our total error budget – and violate our SLO.
So, what do we do to ensure we get the alerts we need, without generating false positives? Well, what we can do is set our lookback period to one hour. This means that each lookback period is 1/168th of our compliance period - because there are 168 hours in a week. With these values, we can calculate an hourly consumption percentage that won’t exceed our total error budget for the week:
5% ÷ 168 ≈ 0.3%
Now, we all know that latency can fluctuate. Because of this, if we set our consumption rate to 0.3% based on our calculation, we may wind up generating unnecessary alerts. So, what we SHOULD do is start with a consumption value to twice that. In this case, we could set it to 0.6%. Once we’ve done that, we could monitor our service and adjust as needed.
So, how do you actually create an alerting policy on an SLO? Well, it’s actually not terribly complicated.
You start by creating the SLO that you wish to alert on. The URL that you see on your screen provides some step by step instructions on creating an SLO.
Once you have the SLO created, you can use Anthos Service Mesh to create the alerting policy. You simply launch Anthos Service Mesh and select the SLO that you want to create an alerting policy for.
After selecting the SLO you want to alert on, you can create your policy by clicking on the “Create Alerting Policy” link.
You can then configure your SLO Burn Rate condition that will cause the alert whenever the SLO error budget declines too rapidly. You’ll want to name your condition and configure the Lookback Duration and Consumption Percentage.
Once you’ve configured your settings, you can provide a name for your alerting policy and configure any necessary triggers. You can also, optionally, configure your notifications and documentation, just like you do when configuring an SLI-based alerting policy.
When you’re all done, you can save the policy.
For a detailed, step-by-step tutorial, visit the URL that you see on your screen.
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.