In this course, we’ll show you how to diagnose a virtual machine that fails to boot on Google Cloud Platform and what steps to take to fix it.
Learning Objectives
- Troubleshoot and fix a virtual machine that fails to boot on Google Cloud Platform
Intended Audience
- Anyone who works with virtual machine instances on Google Cloud Platform
Prerequisites
- Basic experience using virtual machines on Google Cloud Platform (or go through one of our introductory GCP VM labs, such as Starting a Linux Virtual Machine on Google Compute Engine)
What can you do if your VM instance fails to boot up completely? You can’t use SSH because the SSH server isn’t running yet. If you were running the VM on your desktop, then you could look at the console, but how do you do that for a Google Cloud instance? Luckily, there’s a solution. You’d look at the serial port.
By default, you can see the output of the serial port by clicking on the instance and then clicking “Serial port 1”. This might be enough information to help you troubleshoot your problem, but in many cases, you’ll need to interact with the VM to see what’s going on. You’ll notice that there’s a button called “Connect to serial console”, but it’s grayed out. To enable interactive access, you need to add metadata to the instance. This isn’t a terribly user-friendly way of enabling a feature, but it’s actually not too difficult.
First, you have to decide whether you want to enable interactive access for an individual instance or for an entire project. If you enable it on individual instances, then you’ll have to enable it manually for every instance. For convenience, you might want to enable it for an entire project, but there is a higher security risk enabling serial port access for all of your instances because there is currently no way to restrict access by IP address. So hackers could try to break into any of your VMs through the serial port. It wouldn’t be easy, though, because they‘d need to know the correct SSH key, username, project ID, zone, and instance name.
To enable interactive access to an individual instance, you can use this gcloud command: “gcloud compute instances add-metadata”...now put in the instance name, which is “instance-1” in my case, and then “--metadata=serial-port-enable=1” .
Now when I refresh the page, the “Connect to serial port” button lights up. If I click on it, then it brings up another window where I can interact with the serial console.
By the way, if you’re connecting to a Windows instance, then you will need to go into the drop-down menu and select “Port 2”.
If the serial port output showed that you have a problem with the filesystem on your boot disk, then you can attempt to fix it by attaching the disk to another instance.
First, delete the instance, but be sure to include the --keep-disks option. Notice that it still gives me a warning about deleting disks even though I used the keep disks option. That’s normal.
Then create a new instance. I’ll call it “debug-instance”.
Now attach the disk that we saved from the original instance. Notice that, by default, the name of a boot disk is the same as the name of the instance (“instance-1” in this case). You can also add the device name flag so it will be obvious which device corresponds to this disk, which will be helpful in a later step.
Then SSH into the new instance.
Now you need to find out the device name for the debug disk. Look in the /dev/disk/by-id directory.
Remember when I mentioned that naming the disk device would be helpful? You can see that the debug disk is “sdb”. The filesystem is on the first partition (or “part1”), so the device name we need to use is “sdb1”. Now you can run an fscheck on it.
Of course, I’m doing this on a good disk, so fscheck doesn’t see any problems, but if this disk had come from an instance that couldn’t boot properly, then there’s a good chance that an fscheck would find lots of problems.
Let’s pretend that fscheck had to clean the filesystem and it was successful. After that, you should verify that it will mount properly.
You should also check that it has a kernel file.
It does.
But before you celebrate, you should check one more thing - that the disk has a valid master boot record.
It printed out information about the filesystem, so this disk is good to go. Now, you would create a new instance and use this disk as its boot disk.
That took a bit of work, but it was relatively straightforward, wasn’t it?
And that’s it for troubleshooting VM startup failures.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).