Exercise: Monitoring Processes
Start course
3h 7m

In this course, we will explore how to set up a network of virtual machines and how to implement SSH key authentication and execute commands on remote systems. We'll look at how to install and remove software from local and remote systems. You'll learn about continue and break statement and their benefits and use cases. You learn how to automate processes on Linux through the use of cron jobs and examine running processes.

This course is part of the Linux Shell Scripting learning path. To follow along with this course, you can download all the necessary resources here.

Learning Objectives

  • Learn how to create a network of virtual machines and how to configure SSH key authentication and execute commands on remote systems via SSH
  • Learn how to install and remove software packages both on your local system as well as on remote systems
  • Understand continue and break statements in loops and what they're used for
  • Understand what cron is and how to use it to schedule the running of scripts in Linux at various intervals
  • Learn how to examine running processes on a Linux system and how to determine their process IDs

Intended Audience

  • Anyone who wants to learn Linux shell scripting 
  • Linux system administrators, developers, or programmers


To get the most out of this course, you should have a basic understanding of the Linux command line.


Before we start writing our script, let's get some background information and kind of set up the scene here. So let's say a new application, or service, was developed at your company and you put it into production. And then the weekend rolls around and you get paged or texted or called because this particular service that you'd deployed earlier that week is not available and people are wanting to use it. So you log into the server and you notice that well, the reason why no one can get to this particular service is because it's down, it stopped. So you immediately start it, it appears to stay up and people can keep doing what they wanna do and so on, so everything is well for the moment. But of course you're a good system administrator and you just don't like to restart things without knowing why they failed in the first place. So you start digging around in some logs and you happen to notice some OOM or Out of Memory error messages in the syslog. So you probably suspect that this newly deployed application, or first time it's been in production may have a memory leak. Like, it's lasted this long. It's lasted several days, but it gets to a point where the memory leak just consumes all available memory on the system and it dies or gets killed. And then so, you're gonna have to restart it. So you think that this is the issue and you go talk to the developers and you say, "Hey, I think there's a memory leak "in your application. "Can you do something about this?" Well, for whatever reason it's gonna take them a while to fix it. Perhaps they don't even know why the memory leak exists. Or maybe they're not convinced and they wanna see it happen again. Or perhaps there's some kind of constraint, like one of the main developers is on vacation and won't be back for a couple of weeks. Or there's some kind of internal priorities that are gonna pull them away from this new application and they just can't devote any resources to it. Well, no matter the case, you're the one that's gonna get woke up in the middle of the night if this process happens to die in the middle of the night. And you like to sleep. So you think, "Hm, what am I going to do about this?" Well, the simplest thing to do is write a Shell Script that simply watches for this process and if it happens to die, just restart it. Now down the road we might not need this because the process will be fixed by the development team but for now, this is a perfectly acceptable solution. Because we already know what the issue is, we know it's a known issue and it's gonna get fixed. And right now we just need to restart it, just to make sure everybody can keep on keeping on until this is fixed. So that brings us to our requirements today. So we're gonna write a script and we're gonna call it watchdog.sh. And we wanna make sure that this script gets executed with superuser privileges and if it doesn't we're going to exit with an exit status of one. And then we're going to create a configuration file named watchdog.conf.HOSTNAME, and that host name is actually going to be the host name or server name of a particular server. So, for example, it could be watchdog.conf.server01, .server02 and so on. This way you can create a different configuration file for each server and just base it off of its host name. Now you could've done this a different way but we've decided to do it this way this time. In this configuration file there is going to be one line per service. And the very first thing on each line is going to be the process name or the service name. And this is what we're going to look for in the PS output or this is what we're going to try to get the PID of, of this particular service name. And then anything else that remains on that line, what we're going to assume is that is what it takes to start the process again. It's gonna be the command to start the process. So, for example, if we wanted to monitor the SSHD process, the command to start it would be systemctl start SSHD. So that's how we do that. We'll just have one line for each process or service that we're monitoring on the system. If for some reason the configuration file cannot be found or read, we're gonna exit with an exit status of one. As I've kind of been describing what our script is going to do, is check to see if at least one process with that exact name listed in the configuration file is running. And if it is, no action is needed. No action will be taken. But if the process is not running, it will be started. And then we're going to give our Shell Script an exit status of two. So here what we're doing is kinda differentiating the exit statuses. Exit status one is for real errors, such as the configuration file doesn't exist, or you don't have the proper permissions to execute the script and so on. But exit status two is going to tell people that, "Hey, the script worked, but it just so happened "to have restarted the service." It did its job, everything was okay, but it wasn't that all the services was running and I didn't have to do anything. So that's what we're doing there. So here I'm using our multinet vagrant project and I've already ran vagrant up. That takes a while, so I'm gonna save you the pain of waiting through that. And here you can see that all three of those systems, admin01, server01 and sever02 are all up and running. I've SSH-ed into admin01 and now I'm going to move into our shared folder of forward slash vagrant and I'm going to start working on our watchdog.sh script. Let's give it a shebang here. And let's tell anyone who happens to look at the script what this thing is gonna do. Okay, pretty simple, the script starts a process if it is not running. So we're gonna have a different configuration file for each host. And so let's specify this in a variable. And let's call this Conf, underscore, File. And we're going to use the build in host name variable that Bash gives us. And this way, like I said, we can have a different configuration file for each host. You could've solved this in a different way. Perhaps you could've put this watchdog.conf file in user local ETSI and actually named it that, watchdog.conf. So every server has the exact same name of the configuration file. And then that way you just have to make the context different on each different server. Or this way you can just create all the configuration files and they all have a different name. So it's just a matter of style or a matter of choice. Either way would be perfectly acceptable. Okay, so, we're going to actually create a log file, also for each one of these hosts. So we're gonna do something very similar here. Now again, we're using this special shared file system of forward slash vagrant. So anything in there can be seen on admin01, server01, or server02. So if we had all three of those systems writing to the same log file, we would run into a situation here. Now in the real world you would probably just write this somewhere in VAR. Or you would have this entire structure in opt watchdog and then you could have opt watchdog logs or var opt watchdog or something like that to keep all these logs locally on each one of the systems. So, this is how we're handling this particular shared file system in this particular manner. Okay. So we know eventually what we're gonna do with this script is run it via cron. And we know sometimes the environment that you have on your Shell is not gonna be the same that you use in cron. And I happen to know that I'm gonna use the pidof command today, and that's in user Sbin. And so I'm not a 100% sure if that path is gonna be included when we run this in cron. So I'm just gonna go ahead and add that path here so that we're covered. So we'll just take the existing path and then we'll append user sbin to it. All right, so we wanna make sure that the script is being executed as root and we've done this several times now. So I'm just gonna go ahead and type in this check here. We know that root always has the UID of zero, so if the UID is not equal to zero, then they're not running this script as root. Okay, the next check we wanna do is to make sure that the configuration file exists. What this If statement is checking for is the existence of the conf file. So it actually reads, "If not exists conf file, "then eco an error message here, "and then exit with an exit status of one." Speaking of exit statuses, we're going to assume that our program is going to exit with a zero exit status. But we know we're going to be looping through possibly several different services. And in that respect we may have to restart a service and if we do, we wanna make sure that we actually eventually end the entire script with an exit status of two. So we need a variable to hold our exit status and then we can use a variable reassignment if the conditions are correct to do that. And then finally exit with an exit status at the end of our script. So, let's create a variable to hold this little bit of information, which is our exit status. So what we need to do here is just create a wild loop and read this file line by line. And we can do that with a very common convention here, while read line. "Do" starts our loop here. And I'm just gonna leave a little bit of space here and just finish this loop out done. And we're gonna redirect "N" the content of that configuration file. So now the first time this loop is executed, the first line of that configuration file is assigned to the variable line. The second time through, the second line of the configuration file is assigned to the variable line and so on. So now what we need to do is actually restart the service if it's not running. And there's a couple of ways to handle this. We could code that particular bit of functionality right here in our loop, and that would be totally fine. If we did it that way, we would have to extract the first bit of information as the service. And then the rest of the line, let that be the command. And what I'm leaning toward here is actually writing a function and then just using positional parameters. So we know that dollar sign one will be the service name that we're looking for. And then dollar sign two and beyond, or the rest of the positional parameters are actually going to be the command. And we've done that a lot in this course. So I'm just gonna go ahead and make this a function. So I'm gonna do restart service if not running. I like long function names. And then we're just gonna pass it in line. So obviously we need to go back and fill in the code for that service. We haven't even coded it up yet. Okay. So what I wanna do is, again, make sure this exit status is two if we happen to restart a service. So let's have our function return the none-zero exit status if it happens to do a restart. So we can just check for that status here. Dollar sign question mark contains the status. And then if it's not zero, then we know a service got restarted and then we'll assign our exit status to two. We'll close out our if statement there and our little main execution loop for our program is done. So let's go back up to the top of the script here and add our function. I'm going to just put it after our variables here. So we just use the name of the service, parentheses and a opening brace and the service will be done here. And the function needs a closing curly brace as well. So let's go ahead and comment our function. So here we're just saying that this particular function restarts a given service if it's not running. And it requires two pieces of information. The first piece of information being the service name or the process that we're looking for. And the second bit of information it needs is the command to start the service in the case that it's not running. Now this particular function is going to return a zero, if the service name is already running. Or a one, or a none-zero exit status, or return status, I should be more exact, say, if the service was restarted. So the first thing that gets passed in here is going to be the service name and then everything else is going to be the start service command. So we're just gonna shift everything down by one once we have our service name. And that leaves us with anything else being the start service command. Let's create a timestamp so we have a nice log for our script. And by the way, this format I'm using for the date command is the same timestamp format that you'll see in VAR log messages. I decided to use that, make our script look that way. Kind of conform to a standard convention if you will. Now we're gonna write to the log file that we're checking our service. And of course that has our timestamp in it as well. The next thing we're going to do is actually get the PID or PIDs of the service. Now there are a few different ways to do this. I'm just gonna use the pidof command, like I mentioned earlier. So I'm just going to save this in a variable. So we're gonna execute pidof and follow that with a service name. And if any PIDs are returned, they're going to be assigned to the service, underscore, PID variable. Now if it doesn't find any services by that name, then pidof returns to nothing. And that means a service, underscore, PID is nothing as well. So we could use that to actually test with an If statement. So we can do this. If, dash "N", service PID, then. So the dash "N" says, "Hey, if this variable is not empty, then it's true." So if there are PIDs, more or less, then we could tell the person this. We can add this to our log file. And since there were PIDs, that means the process is running, we reported as running. And then we return with a zero. Now, in the other case that means there were no PIDs, so the process that we're looking for is not running. So we need to restart it. Then I'm going to send this message to the log file as well. And then actually execute the start service command. And let's send any of that output that it generates to the log file. And that includes any standard output, as well as any standard error. And of course we're going to return this function with an exit status of one. Okay, that is it for our If statement. And that should complete our function. So this leaves us with only one other thing to do. And that is to exit our script with whatever the proper exit status happens to be. So I go ahead and write my changes here. And go ahead and set the permissions. So the next thing we need to do is actually create the configuration files for our script. So let's do this. Let's create watchdog.conf.admin01 and in here we're going to monitor the SSHD process. And if it fails we need to run the systemctl start SSHD command. And let's look at the rsyslogd process. And this one happens to have a different name to restart it. The service name's actually rsyslog, but the process that actually runs is rsyslogd. So that's kind of why we needed this particular format, for example. Otherwise you could've just used the same process name and then use that process name to restart it. But in our particular case, we may run into this more than in this particular case. Especially if you're working with customers and applications and so on. So this is one of the reasons why we're doing it this way. I'm just actually going to use that as the base of our next configuration file. I'll just call this one watchdog.conf.server01 and then what I need to do here is add httpd as the process that we're watching. And the command to started a systemctl start httpd. Now our server two is actually gonna have the exact same configuration, so I'm just going to copy .conf.server01 to .conf.server02. Let's go ahead and start testing our script from the command line. So let's go ahead and just execute it without any root privileges and see what happens. Okay, we get an error message, and the return status is one, like we want, so that's great. So now let's execute it like we're suppose to here with root privileges. Okay, looks like it executed. Let's see if it created a log file for our server watchdog.log.admin01. Sure enough it did. And it reports that the SSHD service is running with those several PIDs there as well as the rsyslogd with that one particular PID. Now let's stop the rsyslogd process and see if our script actually restarts it. So we'll do sudo systemctl stop rsyslog. And again, the service name here is our syslog, but our process name is our syslogd. So let's go back up here and execute our script. Okay, it executed. And now let's look at our log file. So here are the next set of timestamps here at 17:43:57. It says it's checking the service rsyslogd. Then it says it's restarting our syslogd with the command systemctl start rsyslog. So it looks like it found that the process indeed was not running and it started it for us. So now we've confirmed that our script actually works on the command line, so now it's time to make it a cron job. What I'm gonna have this do is actually run every minute, because I'm kind of impatient when I'm testing. Now in production I'd probably have this run every few minutes, for example, every five minutes. That would probably be acceptable for that. I don't want the script to be in the middle of starting a service and then kick off again. So, perhaps I would give it more time in production. As you know, the script requires root privileges to run. So that means we're going to put this under roots crontabs so it executes as root. And how we can add a roots crontab is with sudo crontab dash E. Or we could switch to the root user first and then run crontab dash E. So here I'm just going to put five asterisks for the time specifications. And then the path to our script. And then any of the output that this script generates, I'm actually going to save into var/tmp. I'm gonna call this watchdog.cronlog. And one of the reasons I like to put this into var/tmp is because var/tmp is not necessarily cleared on reboot. Forward slash tmp typically is, so if something happened where we restarted the service and then the system rebooted and then we wanted to go back and look on our log, then perhaps our log file would be gone. But this helps us to preserve that log file a bit longer. Okay, I'm gonna save my changes here. And then one thing we can do is just simply tail our log file and see if it's running or not. So, our last entry here is at 17:47:01. Actually, it just executed, so, let's go ahead and do this. Let's tail it. And then watch it execute again. Hit enter a couple of times. And then we'll just wait for the next minute and then it will execute and we'll have logs there. Okay, good. It looks like it executed again the next minute. Everything was fine. And everything looks good. Let's look at the log file here. Or the cron log that we created. Or the cron job. Okay, it indeed was created. And let's cat that file to make sure we're not getting any extra output from the cron job itself. And it doesn't look like any output is being generated. And the script is actually just writing everything to the log file like we want. Now, let's make sure that the script actually starts the down service properly when it's running via cron. So we know it works when we do it from the command line, but this time, let's make sure it works via cron. So let's stop the rsyslog process again. And let's go ahead and tell our log file and wait for it to run at the top of the minute and see if it restarts it for us. Okay, sure enough, it says, checking rsyslogd. And then instead of giving a PID it says, "Hey, I'm restarting that for you." So this looks good. So now what I'm going to do is just take this crontab entry here, going to copy it and I'm going to install this crontab on our other systems. So let's put it on server01, for example. Okay, sudo crontab dash L. Okay, that's installed. Let me get off that server and go to server02. And do the same thing here. Okay, we have one cron job for root on server02. Now let's connect here to server01, and let's remind ourself of the services that we have configured to be watched here. So vagrant, watchdog.conf for server01. So here we have httpd. And let's even see if httpd is even running on this system. Okay, sure enough it is. Let's kill httpd and see if our cron job restarts it for us. Okay, there's nothing there. And let's tail our log file. Okay, the script must have executed in just a few seconds there that it took me to tail the log file, because, as you can see, it says it restarted it. And as before it had saw that it was running with other PIDs. For example here, httpd was running with PIDs 14:20, 14:18 and so on. But after we stopped it, our cron job did in fact restarted. All right, that brings us to the end of this exercise. I hope it gives you some ideas of some of the things you can do. You can even take this as a base, or as a template, and expand it to make it the way you want it. For example, you could have a configuration file that specifies the minimum number of processes and the maximum number of processes, for example. And then you could count those processes and check to make sure that the numbers match or were in the range that you wanted. For example, here the httpd process, we know we want more than one of those. We probably want at least four or five, maybe eight or 10 of those on a normal given day. But on the other hand, we also know we don't want a thousand of these things running, right. Because if there are like a thousand different of these processes going, that means we're probably under attack, or that our servers aren't gonna be able to handle it. Or there are connections that are not being terminated properly. And so those would all be good instances where you would want to restart the service. Again, we didn't do that today, but I can leave that up to you as an exercise on your own.

About the Author
Learning Paths

Jason is the founder of the Linux Training Academy as well as the author of "Linux for Beginners" and "Command Line Kung Fu." He has over 20 years of professional Linux experience, having worked for industry leaders such as Hewlett-Packard, Xerox, UPS, FireEye, and Amazon.com. Nothing gives him more satisfaction than knowing he has helped thousands of IT professionals level up their careers through his many books and courses.

Covered Topics