Start course
1h 13m

Once you have implemented your application infrastructure on Google Cloud Platform, you will need to maintain it. Although you can set up Google Cloud to automate many operations tasks, you will still need to monitor, test, manage, and troubleshoot it over time to make sure your systems are running properly.

This course will walk you through these maintenance tasks and give you hands-on demonstrations of how to perform them. You can follow along with your own GCP account to try these examples yourself.

Learning Objectives

  • Use the Cloud Operations suite to monitor, log, report on errors, trace, and debug
  • Ensure your infrastructure can handle higher loads, failures, and cyber-attacks by performing load, resilience, and penetration tests
  • Manage your data using lifecycle management and migration from outside sources
  • Troubleshoot SSH errors, instance startup failures, and network traffic dropping

Intended Audience

  • System administrators
  • People who are preparing to take the Google Professional Cloud Architect certification exam






Looking at real-time monitoring is great, but there will be many times when you’ll want to look at what happened in the past. In other words, you need logs.

For example, suppose I wanted to see when a VM instance was shut down. Compute Engine, like almost every other Google Cloud Platform service, writes to the Cloud Audit Logs. These logs keep track of who did what, where, and when.

There are three types of audit logs. Admin Activity logs track any actions that modify a resource. This includes everything from shutting down VMs to modifying permissions.

System Event logs track Google’s actions on Compute Engine resources. Some examples are maintenance of the underlying host and reclaiming a preemptible instance.

Data Access logs are pretty self-explanatory. They track data requests. Note that this also includes read requests on configurations and metadata. Since these logs can grow very quickly, they’re disabled by default. One exception is BigQuery Data Access logs, which are not only enabled by default, but it’s not even possible to disable them. Fortunately, you won’t get charged for them, though.

In the console, select “Logging”.

There are lots of options for filtering what you see here. You can look at the logs for your VM instances, firewall rules, projects, and many other components. You can even send logs from other cloud platforms like AWS to here. You just need to install the logging agent on any system that you want to get logs from.

This is a great way to centralize all of your logs. Not only does centralizing your logs make it easier to search for issues, but it can also help with security and compliance, because the logs aren't easy to edit from a compromised node.

In this case, we need to look at the VM instance logs. You can choose a specific instance or all instances. I only have one instance right now, called instance-1. Since we installed the logging agent on instance-1 in the last lesson, there are already some log entries for it. 

Here you can choose which logs you want from the instance, such as the Apache access and error logs. I could set it to “syslog” since that’s where the shutdown message will be, but I’ll just leave it at “All logs” because sometimes you might not know which log to look in.

You can also filter by log level, and for example, only look at critical entries. I’ll leave it at “Any log level”.

Finally, you can change how far back it will look for log entries. I’ll change it to the last 24 hours.

OK, now I’ll search for any entries that contain the word “shutdown” so I can see if this instance was shut down in the last 24 hours.

If you need to do really serious log analysis, then you can export the logs to BigQuery, which is Google’s data warehouse and analytics service. Before you can do that, you need to have the right permissions to export the logs. If you are the project owner then, of course, you have permission. If you’re not, then the “Create Sink” button will be greyed out, and you’ll have to ask a project owner to give you the Logs Configuration Writer role.

First, click the “Create Sink” button. A sink is a place where you want to send your data. Give your sink a name, such as “example-sink”. Under “Sink Service”, you have quite a few options, such as BigQuery, Cloud Storage, or a Custom destination. We’ll choose BigQuery. 

Under “Sink Destination”, you have to choose a BigQuery dataset to receive the logs. If you don’t have one already, then click “Create new BigQuery dataset”. Give it a name, such as “example_dataset”. Note that I used an underscore instead of a dash because dashes are not allowed in BigQuery dataset names. Now click the “Create Sink” button.

It says the sink was created, so let’s jump over to BigQuery and see what’s there. Hmmm. It created our example dataset, but it doesn’t contain any tables, which means it doesn’t have any data. That’s weird, right? Well, it’s because when you set up a sink, it only starts exporting log entries that were made after the sink was created.

OK, then let’s generate some more log entries and see if they get exported. I’ll restart the VM, which will generate lots of log entries. Okay, I’ve restarted it. Now if we go back to the Logging page, do we see the new messages? Yes, we do.

Now let’s go back to BigQuery and see if the data’s there. Yes, there are two tables there now. Click on the syslog table. Now click the “Query Table” button. To do a search in BigQuery, you need to use SQL statements, so let’s write a simple one just to verify that the log entries are there.

Thankfully, it already gave me the skeleton of a SQL statement. I just need to fill in what I’m selecting. I’ll put in an asterisk to select everything, but I’ll restrict it by using a WHERE clause with the column name “textPayload” (which is the column that contains the text in the log entry)...”LIKE '%shutdown%'”. The percent signs are wildcards, so this SQL statement says to find any log entries that have the word “shutdown” in them somewhere.

Now we click the “Run” button...and it returns the matching log entries. If we scroll to the right, then we can see the textPayload field and it does indeed contain the word “shutdown” in each of the entries.

Of course, we did exactly the same search on the Logging page and it was way easier, so why would we want to go through all of this hassle of exporting to BigQuery and writing SQL statements? Well, because sometimes you may need to search through a huge number of log entries and need to do complicated queries. BigQuery is lightning fast when searching through big data, and if you build a complex infrastructure in Google Cloud Platform, then the volume of log data it will generate will easily qualify as big data.

Since we don’t want our example sink to keep exporting logs to BigQuery and incurring storage charges, let’s delete what we’ve created. On the Logging page, click on “Logs Router” in the left-hand menu, then select the sink and delete it.

We should also delete the BigQuery dataset, so go back to the BigQuery page, select the dataset, and click “Delete Dataset”. It wants you to be sure that you actually want to delete the dataset, so you have to type the dataset name before it will delete it.

One concern that you or your company may have is how to ensure the integrity of your logs. Many hackers try to cover their tracks by modifying or deleting log entries. There are a number of steps you can take to make it more difficult to do that.

First, apply the principle of least privilege. That is, give users the lowest level of privilege they need to perform their tasks. In this case, only give the owner role for projects and log buckets to people who absolutely need it.

Second, track changes by implementing object versioning on the log buckets. The Cloud Storage service automatically encrypts all data before it is written to the log buckets, but you can increase security by forcing a new version to be saved whenever an object in a log bucket is changed. Unfortunately, this won’t prevent an owner from deleting an incriminating object, which is why you need to keep tight control on which users are given the owner role.

Third, you could add more protection by requiring two people to inspect the logs. You could copy the logs to another project with a different owner using either a cron job or the Cloud Storage Transfer Service. Of course, this still won’t prevent an owner in the first project from deleting the original bucket before the copy occurs or from disabling the original logging.

So the bottom line is that a person with the owner role can get around just about anything you put in place, but you can make it nearly impossible for someone without the owner role to change the logs without you knowing about it.

And that’s it for this lesson.


Course Introduction - Monitoring - Error Reporting and Debugging - Tracing - Testing - Storage Management - Cloud SQL Configuration - Cloud CDN Configuration - Instance Startup Failures - SSH Errors - Network Traffic Dropping - Conclusion

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).