Troubleshooting VN Networking Issues
The course is part of this learning path
In this course, we’ll go over some of the most common networking issues when connecting to or from a virtual machine on Google Cloud Platform. We’ll cover what to do when you get an SSH error or when you discover that network traffic between your VM instance and other systems is dropping.
- Troubleshoot the most common networking issues when connecting to or from a virtual machine on Google Cloud Platform
- Anyone who works with virtual machine instances on Google Cloud Platform
- Basic experience using virtual machines on Google Cloud Platform (or go through one of our introductory GCP VM labs, such as Starting a Linux Virtual Machine on Google Compute Engine)
Welcome to “Troubleshooting Virtual Machine Networking Issues on GCP”. To get the most from this course, you should already have some basic experience using virtual machines on Google Cloud Platform. If you don’t have this experience, then you can go through one of our introductory GCP VM labs, such as “Starting a Linux Virtual Machine on Google Compute Engine”.
All right, let’s get started. One relatively common issue is when you can’t connect to a virtual machine instance using SSH.
When this happens, then a feeling of helplessness can set in. But, don’t worry, there are quite a few things you can do to resolve the issue.
Before we get into how to troubleshoot the “Connection failed” error you see here, let’s go over how to deal with the “Permission denied” error. This error can occur for obvious reasons, such as not using the right flags on your ssh command when you try to connect, but there are also some potential issues that are specific to Google Cloud Platform.
To understand why, you need to know the different ways you can configure SSH connections on GCP instances. First up is “OS Login”, which is the preferred method. It lets you control access by using IAM roles. The advantage of using IAM roles is that they’re the standard way to control access to all other resources on GCP, too. OS Login can also manage your public SSH keys for you, and it even supports two-factor authentication.
If you can’t use the OS Login method for some reason, then you can add public SSH keys to metadata. One easy way to do it is to add project-wide SSH keys by putting them in a project’s metadata. Any user who has an SSH key in a project’s metadata can connect to any instance in that project with the exception of instances that specifically block project-wide SSH keys. The other way is to add SSH keys to the metadata of each instance. This gives you more fine-grained control of who can access individual instances, but it takes a lot more effort to manage.
Okay, so how might these different configurations lead to a “Permission denied” error? There are several different ways. For example, OS Login disables SSH keys in metadata, so if you try to connect to an OS Login-enabled VM using a key in metadata, you’ll get an error. The opposite configuration also generates an error. That is, if you try to use a key stored in an OS Login profile to connect to a VM that doesn’t have OS Login enabled, then it won’t work. Another scenario is if you try to use a project-wide SSH key to connect to a VM that has project-wide keys disabled.
All right, now let’s get back to the “Connection failed” error. Here’s how to troubleshoot it.First, check your firewall rules. By default, your network will contain a firewall rule that allows SSH traffic, but you should make sure that the rule wasn’t deleted or modified. The easiest way to check is to go into the Networking section of the Google Cloud console and click on Firewall Rules. There should be a rule called “default-allow-ssh” that allows traffic on tcp port 22. The source should be all zeroes (meaning it will allow SSH requests from any IP address) and the target should say “Apply to all” targets.
If you don’t have this rule, then you can easily create it. Click “Create Firewall Rule”. You don’t have to name it “default-allow-ssh”, but you probably should, just so it follows the convention of network name (that is “default”), allow, and protocol (that is, “ssh”). Change the priority to 65534. Leave “Direction of traffic” as “Ingress” and “Action on match” as “Allow”. Change “Targets” to “All instances in the network”. Leave the “Source filter” as “IP ranges”. In the “Source IP ranges” field, put 0.0.0.0/0. And put “tcp:22” in the “Protocols and ports” field. Now click the “Create” button and that’s it. Let’s try connecting with SSH again and see if that fixed it. Great, it did.
If you didn’t have a problem with the firewall rules, but you still can’t get in with SSH, then try connecting to port 22 manually and see if the SSH server responds. Use the “nc” command on the external IP address of your instance and then specify port 22. If you see the SSH banner, then you know that the network connection is ok and the SSH server is running. If you don’t see the banner, then it could be a problem with either the network connection or the SSH server.
Next, try accessing the serial console, which I showed in the previous lesson. Look at the serial port output to see if that will tell you why you can’t connect using SSH.
The next thing to check is if there’s a problem with your account. Try connecting with another username, like this.
You can put in whatever username you want and the gcloud tool will update the project's metadata to add the new user and allow SSH access. I’m going to call it “newuser”.
If that didn’t work and your instance boots from a persistent disk (which is the case by default), then you can detach the persistent disk and attach it to a new instance. Of course, this will take down your existing instance, so if the instance is serving production users properly and you don’t want to cause an outage, then skip this procedure and go to the next one.
First, delete the instance and be sure to include the --keep-disks option.
Then create a new instance (I’m going to create it with the same name as the original one) and attach the disk that we saved from the original instance. Also add the “auto-delete=no” option so the disk isn’t automatically deleted when you delete this instance.
Now SSH into the new instance.
That worked. If your instance is serving production users properly and you don’t want to cause an outage, then you can follow this procedure instead of the one I just did.
First, create an isolated network that only allows SSH connections. This is because you’re going to clone your production instance and you don’t want the clone to interfere with your production services.
Now add a firewall rule to allow SSH connections to the network.
Then create a snapshot of the boot disk.
Now you can create a new disk with the snapshot you just created.
Then create a new instance in the new network.
Now attach the cloned disk.
By the way, Google recommends that you don’t give this instance an external IP address, but that makes it much more difficult to connect to it. Considering that the instance is sitting in a network that blocks everything except SSH requests, it shouldn’t cause any disruption to your production services even though it has an external IP address.
Now SSH into this instance.
Although the cloned disk is attached to this instance, it isn’t mounted anywhere, so you’ll have to do that manually.
First create a mount point.
Then see what the device name of the disk is. The boot disk is disk 0, so the extra disk we attached has to be disk 1.
Then mount the filesystem.
Now you can finally try to debug why you can’t connect to the original instance using SSH. You can look through the logs, for instance.
That was a pretty complicated process, but that’s the sort of thing you have to do if you don’t want to disrupt your production service while you’re debugging it. Here’s a summary of what I just showed you. First, create an isolated network that doesn’t allow any connections. Then add a firewall rule to allow SSH connections. Next, create a snapshot of the boot disk. After that, you can create a new disk from the snapshot. Then create a new instance in the new network, and attach the cloned disk to it. Finally, SSH into the instance, and mount the disk so you can inspect it.
If you still can’t find the reason why your instance won’t accept SSH connections and you are able to restart the instance at some point, then you can use a startup script to gather information. If you’re not sure what to put in the startup script, then you can use one provided by Google.
The script will run the next time the instance boots, so when you’re ready, you can run [gcloud compute instances reset instance-1]. The script sends its output to the serial port, so look there for the debugging info.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).