Contents
Troubleshooting Databases on GCP
In this lesson you will learn how to diagnose database issues using Google’s Cloud Monitoring and Cloud Logging services.
Learning Objectives
- Manage and minimize your system downtime
- Optimize the performance of your Google databases
Intended Audience
- Database administrators
- Database engineers
- Cloud architects
- Anyone preparing for a Google Cloud certification
Prerequisites
- Some experience working with databases
- Access to a GCP account
Now monitoring and alerting are both automatically available for all Google-managed services like Cloud SQL or Firestore. However, there may be times where you need to monitor an unmanaged service. For example, what if you are running MySQL on a Cloud Compute VM? Well, in that case, you can install the Ops Agent. The Ops Agent client will collect telemetry data from your VMs, and send it to Cloud Monitoring. This way, you can still use Cloud Monitoring even on a non-Google database. Let me quickly run you through how to do that.
There are two main ways to install Ops Agent. First, you can use the web console. Just start by clicking on “Setup Agents” here. And then, this page will list out all the VMs that are currently running in Compute Engine. So now I just have to select the appropriate VMs, and then click on “Install/Update Agents”. And that’s it. Google will then install the agent. And after a little while, your metrics will start to show up in Cloud Monitoring. This is the simplest way to install Ops Agent. And it also makes it very easy to install the agent on many different VMs at once.
The second way is via manual installation. You can find the complete set of instructions at this URL. So let me scroll down. I will expand this section here. And then I can choose my preferred method. Let me show you how to install the agent on a single virtual machine. All I need to do is copy these commands. And then paste them into the appropriate terminal. Here are the commands for a Linux machine. And then these are the commands for a Windows machine. You can either manually run these commands. Or you could do something like include them in a custom startup script. Whatever you prefer.
So now you should understand how to monitor your databases on Google Cloud Platform. The only question left is: What metrics should I be monitoring? Now that is a great question. Unfortunately, there is no simple answer.
Typically, you want your dashboards to track your SLIs. And then you want your alerts to notify you when an SLI exceeds an SLO or a SLA. In case you are not already familiar with these terms, I am going to give you a quick explanation.
A Service Level Agreement (or SLA) is a guarantee you make to your customers. It is basically a commitment to deliver a certain level of service. For example, you might tell your customers that your service is guaranteed to have a 99.5% uptime. Or you might have an API that is guaranteed to respond to any request within 500ms.
An SLA is a promise. And if you ever break this promise, you generally offer some form of recompensation. So if your service goes offline for a long period of time, or it takes too long to respond, your customers might earn credits towards a future bill.
A Service Level Objective (or SLO) is an internal goal. For every SLA, you want a corresponding SLO. Your SLOs define a “safe zone”. As long as your current performance stays below the SLO, then everything is fine. Once you exceed an SLO, then your service is in danger of violating an SLA. And your team needs to take immediate action to improve performance.
A Service Level Indicator (or SLI) is an actual measurement of how your service is performing. So this could be the total minutes of downtime per month. Or it could be the time it takes a query to complete and return a result.
So, in summary:
SLIs represent your current performance. When an SLI is lower than the corresponding SLO, then you are in the “green zone”. That means your service is performing as expected. When an SLI is higher than the SLO but lower than the SLA, you are in the “yellow zone”. That means, you have a problem and you need to fix it before it gets worse. When an SLI is greater than your SLA, then you are in the “red zone”. At this point, you need to notify your customers about the problem and offer them some compensation.
So, you need to identify what metrics you care about. Then you can begin to measure those metrics (called SLIs) using your monitoring dashboards. You also need to define appropriate SLOs and SLAs. Then use alert policies to be notified when you exceed one of them.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.