The course is part of these learning paths
Once you have implemented your application infrastructure on Google Cloud Platform, you will need to maintain it. Although you can set up Google Cloud to automate many operations tasks, you will still need to monitor, test, manage, and troubleshoot it over time to make sure your systems are running properly.
This course will walk you through these maintenance tasks and give you hands-on demonstrations of how to perform them. You can follow along with your own GCP account to try these examples yourself.
- Use Stackdriver to monitor, log, report on errors, trace, and debug
- Ensure your infrastructure can handle higher loads, failures, and cyber-attacks by performing load, resilience, and penetration tests
- Manage your data using lifecycle management and migration from outside sources
- Troubleshoot SSH errors, instance startup failures, and network traffic dropping
We've covered what to do when your instance won't boot at all, but what if it boots at least partway, but you can't connect to it using SSH?
When you can't connect with SSH, then a feeling of helplessness can set in. But don't worry, there are quite a few things you can do to resolve the issue.
First, check your firewall rules. By default, your network will contain a firewall rule that allows SSH traffic, but you should make sure that the rule wasn't deleted or modified. The easiest way to check is to go into the networking section of the Google Cloud console and click on firewall rules. There should be a rule called default-allow-ssh that allows traffic on TCP port 22. The source should be all zeroes, meaning it will allow SSH requests from any IP address and the target should say apply to all targets.
If you don't have this rule, then you can easily create it. Click create firewall rule. You don't have to name it default-allow-ssh, but you probably should, just so it follows the convention of network name, that is default, allow, and protocol, that is SSH. Change the priority to 65534, leave direction of traffic as ingress, and action on match as allow. Change targets to all instances in the network, leave the source filter as IP ranges. In the source IP ranges field, put 0.0.0.0/0, and put TCP:22 in the protocols and ports field. Now click the create button and that's it. Let's try connecting with SSH again and see if that fixed it. Great, it did!
If you didn't have a problem with the firewall rules, but you still can't get in with SSH, then try connecting to port 22 manually and see if the SSH server responds. Use the nc command on the external IP address of your instance and then specify port 22. If you see the SSH banner, then you know that the network connection is okay, and the SSH server is running. If you don't see the banner, then it could be a problem with either the network connection or the SSH server.
Next try accessing the serial console, which I showed you in the previous lesson. Start off by looking at the serial port output to see if that will tell you why you can't connect using SSH. If that doesn't help, then try connecting to the serial port. However, there is a potential problem with this, let me show you.
First I'll click on the connect to serial port button. It's giving me a login prompt. What user ID and password should I use? Well it created a user account called guide to match my local user ID, but there's no way to know the password for that account. So why is it throwing up this road block? Well as you saw in the last lesson, if this instance had not booted up past single-user mode, then it would've given me access without asking me to log in. But since in this case, it has booted into multi-user mode, I need to log in.
Google's solution is to log in using SSH and set a password for the newly created user. Now I can log in with that user through the interactive serial port console.
But what was the point of that? If you can log in using SSH, then why do you need to go in through the serial port? Well you probably don't.
Let's recap. If your instance hasn't booted past single-user mode, then you won't be able to log in using SSH, but you can access the instance through the serial port and it won't ask you for a password. If your instance has booted into multi-user mode, then you probably can access it using SSH, so you won't need serial port access. However, if the instance has booted up and you can't access it using SSH, then you won't be able to access it using the serial port either.
The next thing to check is if there's a problem with your account. Try connecting with another username like this. You can put in whatever username you want and the gcloud tool will update the project's metadata to add the new user and allow SSH access. I'm going to call it newuser.
If that didn't work, and your instance boots from a persistent disk, which is the case by default, then you can detach the persistent disk and attach it to a new instance. Of course this will take down your existing instance, so if the instance is serving production users properly and you don't wanna cause an outage, then skip this procedure and go to the next one.
First, delete the instance and be sure to use the keep disks option.
Then create a new instance. I'm going to create it with the same name as the original one and attach the disk that we saved from the original instance. Also add the auto-delete=no option so the disk isn't automatically deleted when you delete this instance.
Now SSH into the new instance.
That worked. If your instance is serving production users properly, and you don't want to cause an outage, then you can follow this procedure instead of the one I just did.
First create an isolated network that only allows SSH connections. This is because you're going to clone your production instance and you don't want the clone to interfere with your production services.
Now add a firewall rule to allow SSH connections to the network.
Then create a snapshot of the boot disk.
Now you can create a new disk with the snapshot you just created.
Then create a new instance in the new network.
Now attach the cloned disk.
By the way, Google recommends that you don't give this instance an external IP address, but that makes it much more difficult to connect to it. Considering that the instance is sitting in a network that blocks everything except SSH requests, it shouldn't cause any disruption to your production services, even though it has an external IP address.
Now SSH into this instance.
Although the cloned disk is attached to this instance, it isn't mounted anywhere, so you'll have to do that manually. First create a mount point. Then see what the device name of the disk is. The boot disk is disk zero, so the extra disk attached has to be disk one. Then mount the filesystem.
Now you can finally try to debug why you can't connect to the original instance using SSH. You can look through the logs for instance.
That was a pretty complicated process, but that's the sort of thing you have to do if you don't want to disrupt your production service while you're debugging it.
If you still can't find the reason why your instance won't accept SSH connections, and you're able to restart the instance at some point, then you can use a startup script to gather information. If you're not sure what to put in the startup script, then you can use one provided by Google.
The script will run the next time the instance boots, so when you're ready you can run gcloud compute instances reset instance-1. The script sends its output to the serial port, so look there for the debugging info.
And that's it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).