image
SLO Monitoring & Alerting
Start course
Difficulty
Intermediate
Duration
44m
Students
424
Ratings
5/5
starstarstarstarstar
Description

This course shows you how to monitor your operations on GCP. It starts with monitoring dashboards. You'll learn what they are and how to create, list, view, and filter them. You'll also see how to create a custom dashboard right in the GCP console.

The course then moves on to monitoring and alerting, where you'll learn about SLI-based alerting policies and third-party integrations. You'll also learn about SLO monitoring and alerting, along with integrating GCP monitoring with products like Grafana. We’ll wrap things up by touching on SIEM tools that are used to analyze audit and flow logs.

This course contains a handful of demos that give you a practical look at how to apply these monitoring techniques on the GCP platform. If you have any feedback relating to this course, feel free to reach out to us at support@cloudacademy.com.

Learning Objectives

  • Create, list, view, and filter dashboards
  • Configure notifications, including through third-party channels
  • Learn about SLI- and SLO-based alerting and monitoring
  • Integrate GCP operations monitoring with Grafana
  • Analyze logs with SIEM tools

Intended Audience

This course is intended for anyone who wishes to learn how to manage GCP Operations monitoring.

Prerequisites

To get the most out of this course, you should already have some experience with Google Cloud Platform.

 

 

 

Transcript

Hello, and welcome to SLO Monitoring and Alerting. In this lesson, we are going to look at some key concepts that you need to understand in order to effectively monitor SLOs and to alert on them.

Let’s start by defining what an SLO is. An SLO is really a target value for an SLI that’s measured over a period of time. While the service itself determines what SLIs are available, it is YOU that specifies your SLOs based on the available SLIs. You use the SLO to define what qualifies as good service. Cloud Monitoring allows you to create up to 500 SLOs for each service.

When you build an SLO, you need to define an SLI, a performance goal, and a time period, which is usually referred to as the “compliance period”. The defined SLI measures the performance of a specific service, while the performance goal specifies the desired level of performance you want to achieve. The time period, or compliance period, is used to measure how your defined SLI compares to the performance goal.

For example, you might have a requirement that latency can exceed 200ms in no more than 3% of requests over the last 30 days, while maintaining a 99% availability over those same 30 days.

These types of requirements are the types that you would use when defining SLOs. 

If your application, system, or environment starts missing SLO compliance, it can mean that something is wrong. That being the case, it’s important to monitor for these changes so you have a chance to address the underlying cause before a small problem becomes a big problem. Implementing alerting policies in connection with your monitoring ensures you are notified when you start missing SLOs.

I should mention that SLOs are most useful when they are set at a value below 100%. This is because the SLO that you define will determine what your error budget is. You know those terms you hear about? 4 nines, 3 nines, and the like? This is how SLOs are usually described. Although the highest value you can set for an SLO is 99.9%, you can use any value below that if it is appropriate for your specific service.

Speaking of error budgets, what exactly does the term “error budget” refer to? As I mentioned previously, the SLO refers to how a service needs to perform during a compliance period. The error budget is whatever is left over in the compliance period. It’s a number that essentially quantifies the degree to which a service can fail to perform during the compliance period and still meet the established SLO.

What an error budget does is allow you track how many bad individual events are allowed during the remainder of the compliance period before the SLO is missed. As the error budget gets used up, you want to avoid risky actions like pushing new updates, because such actions could result in missing the SLO.

To calculate your error budget for a compliance period, use the formula that you see on your screen:

(1 − SLO goal) × (eligible events in compliance period)

So, using this formula, if you have an SLO for 85% of requests to be good in a rolling period of one week, your error budget allows 15% of these requests to be bad. 

Let’s assume you received 50K requests in the past week.

With these numbers, we can calculate the error budget by taking 15% of that total (because the formula says 1 minus the SLO goal of 85%) and multiply it by the 50k requests. This means that our error budget for the number of requests that can be bad is 7500 requests.

If more than 7500 bad requests are served, the service would be out of SLO for the 1-week compliance period.

By creating an alerting policy for your SLO, you can get notified when things go bad – and doing so isn’t terribly different from creating an alerting policy for any other metric.

To create an alerting policy for an SLO, start by identifying the SLO you want to create the alerting policy for. 

Once you’ve identified the SLO you want to alert on, define a condition for the alerting policy that uses the SLO you are interested in. When you define the condition, what you are doing is using the time-series selector to pull the data for the SLO. You also need to specify the threshold and duration of violations of the SLO before an alert is triggered. 

Once you’ve defined your condition, you need to decide on a notification channel that should be used in your alerting policy. You can use an existing channel or create a new one.

You should then create documentation that explains what triggers the alerting policy.

Lastly, you’d assemble these pieces into an invocation to create an alerting policy.

About the Author
Students
84266
Courses
86
Learning Paths
64

Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.

In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.

In his spare time, Tom enjoys camping, fishing, and playing poker.