The course is part of these learning paths
Once you have implemented your application infrastructure on Google Cloud Platform, you will need to maintain it. Although you can set up Google Cloud to automate many operations tasks, you will still need to monitor, test, manage, and troubleshoot it over time to make sure your systems are running properly.
This course will walk you through these maintenance tasks and give you hands-on demonstrations of how to perform them. You can follow along with your own GCP account to try these examples yourself.
- Use Stackdriver to monitor, log, report on errors, trace, and debug
- Ensure your infrastructure can handle higher loads, failures, and cyber-attacks by performing load, resilience, and penetration tests
- Manage your data using lifecycle management and migration from outside sources
- Troubleshoot SSH errors, instance startup failures, and network traffic dropping
Looking at real-time monitoring is great, but there will be many times when you'll want to look at what happened in the past. In other words, you need logs.
For example, suppose I wanted to see when a VM instance was shut down, such as when I shut down my web server instance in the Monitoring lesson. Compute Engine, like almost every other Google Cloud Platform service, writes to the Cloud Audit Logs in Stackdriver. These logs keep track of who did what, where, and when.
There are two ways to get to the Logging page. If you're already on the Stackdriver Monitoring page, then you can select Logging from there. Notice that it takes you back to the Google Cloud console. Obviously, Stackdriver is not yet seamlessly integrated into Google Cloud because some parts of it, like Logging are in the regular console, and some parts of it, like Monitoring are on a separate Stackdriver page. If you are in the regular Google Cloud Console, you can select Logging from there.
There are lots of options for filtering what you see here. You can look at the logs for your VM instances, firewall rules, projects, and many other components. You can even send logs from other cloud platforms like AWS to Stackdriver Logging. You just need to install the logging agent on any system that you want to get logs from.
This is a great way to centralize all of your logs. Not only does centralizing your logs make it easier to search for issues, but it can also help with security and compliance, because the logs aren't easy to edit from a compromised node.
In this case, we need to look at the VM instance logs. You can choose a specific instance or all instances. I only have one instance right now, called instance-1. Since we installed the logging agent on instance-1 in the last lesson, Stackdriver has already captured some log entries for it. I have also had other instances in the past, which is why there are all of these ID numbers in the menu. If I wanted to, I could look at the logs for instances that no longer exist.
Here you can choose which logs you want from the instance, such as the Apache access and error logs. I could set it to syslog since that's where the shutdown message will be, but I'll just leave it at All logs, because sometimes you might not know which log to look in.
You can also filter by log level, and for example, only look at critical entries. I'll leave it at Any log level.
Finally, you can jump to a particular date and time. I did the Monitoring lesson two days ago, so I could set the date to March 25th to see when I shut down the VM on that day, but bear in mind that it also takes the time into account. It will only show log entries before this time on that day, so if the shutdown happened after that time, then it won't show those log entries. To be safe, I'll just leave it with today's date and it will show me all of the log entries that happened before now.
OK, now I'll search for any entries that contain the word shutdown. Now we can see when I shut down the VM on that day.
If you need to do really serious log analysis, then you can export the logs to BigQuery, which is Google's data warehouse and analytics service. Before you can do that, you need to have the right permissions to export the logs. If you are the project owner then, of course, you have permission. If you're not, then the Create Export button will be greyed out and you'll have to ask a project owner to give you the Logs Configuration Writer role.
First, click the Create Export button. You'll notice that it added some fields about something called a sink. A sink is a place where you want to send your data. Give your sink a name, such as example-sink. Under Sink Service, you have four options, BigQuery, Cloud Storage, Cloud Pub/Sub, or a Custom destination. We'll choose BigQuery. Under Sink Destination, you have to choose a BigQuery dataset to receive the logs. If you don't have one already, then click Create new BigQuery dataset. Give it a name, such as example_dataset. Note that I used an underscore instead of a dash because dashes are not allowed in BigQuery dataset names. Now click the Create Sink button.
It says the sink was created, so let's jump over to BigQuery and see what's there. Hmmm. It created our example dataset, but it doesn't contain any tables, which means it doesn't have any data. That's weird, right? Well, it's because when you set up a sink, it only starts exporting log entries that were made after the sink was created.
OK, then let's generate some more log entries and see if they get exported. Shutting down the VM will generate lots of log entries. Now if we go back to the Logging page and do a refresh, do we see the new messages? Yes, we do.
Now let's go back to BigQuery and see if the data is there. Yes, there's a table there now. Click on the table name. Now click the Query Table button. To do a search in BigQuery, you need to use SQL statements, so let's write a simple one just to verify that the log entries are there.
Thankfully, it already gave me the skeleton of a SQL statement. I just need to fill in what I'm selecting. I'll put in an asterisk to select everything, but I'll restrict it by using a where clause with the column name textPayload, which is the column that contains the text in the log entry. Then I put in LIKE '%shutdown%'. The percent signs are wildcards, so this SQL statement says to find any log entries that have the word shutdown in them somewhere.
Now we click the Run Query button and it returns four log entries. If we scroll to the right, then we can see the textPayload field and it does indeed contain the word shutdown in each of the entries.
Of course, we did exactly the same search on the Logging page and it was way easier, so why would we want to go through all of this hassle of exporting to BigQuery and writing SQL statements? Well, because sometimes you may need to search through a huge number of log entries and need to do complicated queries. BigQuery is lightning fast when searching through big data, and if you build a complex infrastructure in Google Cloud Platform, then the volume of log data it will generate will easily qualify as big data.
Since we don't want our example sink to keep exporting logs to BigQuery and incurring storage charges, let's delete what we've created. On the Logging page, click on Exports in the left-hand menu, then select the sink and delete it. We should also delete the BigQuery dataset, so go back to the BigQuery page. It's not obvious how to delete a dataset, but if you put your mouse pointer over the dataset name, then there will be a dropdown arrow on the right and its menu contains a delete option. It wants you to be sure that you actually want to delete the dataset, so you have to type the dataset name before it will delete it.
One concern that you or your company may have is how to ensure the integrity of your logs. Many hackers try to cover their tracks by modifying or deleting log entries. There are a number of steps you can take to make it more difficult to do that.
First, apply the principle of least privilege. That is, give users the lowest level of privilege they need to perform their tasks. In this case, only give the owner role for projects and log buckets to people who absolutely need it.
Second, track changes by implementing object versioning on the log buckets. The Cloud Storage service automatically encrypts all data before it is written to the log buckets, but you can increase security by forcing a new version to be saved whenever an object in a log bucket is changed. Unfortunately, this won't prevent an owner from deleting an incriminating object, which is why you need to keep tight control on which users are given the owner role.
Third, you could add more protection by requiring two people to inspect the logs. You could copy the logs to another project with a different owner using either a cron job or the the Cloud Storage Transfer Service. Of course, this still won't prevent an owner in the first project from deleting the original bucket before the copy occurs or from disabling the original logging.
So the bottom line is that a person with the owner role can get around just about anything you put in place, but you can make it nearly impossible for someone without the owner role to change the logs without you knowing about it.
That's it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).