Exercise: Parsing Log Files
Start course
1h 33m

In this course, we'll explore a range of options for manipulating data. We'll look at some commands that can be used in shell scripts to transform and process data. We'll then walk you through an example of parsing through log files to find data that matches certain criteria. Finally, you'll learn how to manipulate data and transform text with sed.

This course is part of the Linux Shell Scripting learning path. To follow along with this course, you can download all the necessary resources here.

Learning Objectives

  • Learn a range of commands used for transforming, processing, and reporting data including cut, awk, sort, and uniq
  • Learn how to parse log files for failed login attempts and identify the IP addresses of those
  • Manipulate data and transform text with sed

Intended Audience

  • Anyone who wants to learn Linux shell scripting 
  • Linux system administrators, developers, or programmers


To get the most out of this course, you should have a basic understanding of the Linux command line.


For this shell scripting exercise, we're going to write a script named and this script is going to require a file be provided as an argument. And if a file is not provided or we can't read it for some reason, then what we're going to do is have the script display an error message and exit with an exit status of one. This script counts the number of failed login attempts by IP address. If there are any IP addresses with more than 10 failed login attempts, the number of attempts made, the IP address from which those attempts were made and the location of that IP address will be displayed. By the way, we're going to be using a command called a geoiplookup to determine the location of that IP address. We may or may not have covered that in a previous demonstration or lesson. So just be aware of that command, give it an IP address and it'll tell you where it thinks that IP address originates from. We're going to make this script produce output in CSV and of course, CSV stands for Comma Separated Values. And we're going to give the output a header of Count,IP,Location. So here I am on a system and the sample data that we're going to use for this exercise is included in the course download. I've already placed a copy of it in the shared vagrant folder for this project. So if I see cd into /vagrant, then I can see all the files that are in that folder on my local physical machine. The first thing I do when working on these types of problems is to look at the data that I'm dealing with. I wanna know what the input looks like first, then I can start to look at ways to transform that data so that it's easy to work with or so that it meets my requirements. So let me just look at the contents of this file. It's a lot of data. One thing I notice a lot, or I see a lot of are lines that say failed password for root. I'm going to bet that people are not only trying to log in as the root user but probably some other common users as well. Let's see if that's the case. So a common thing here on these lines for failed password for root is the word failed with a capital F. So let me just grep out that pattern, grep failed, syslog-sample. Here on our screen, I only see root. I think I saw a variation or two above when this output was scrolling by. So let's exclude root to see what else we get. If we look at the last five lines or so of output, we see login attempts for the Ubuntu user, the admin user, the LP user, the admin user again and a user named a. What I notice is that the lines aren't exactly in the same format. What sticks out to me here is that the invalid user exists on most of those lines here but they don't exist on all of the lines, for example, the LP line. That third line from the bottom here says for LP from but if you look at the next line, it says for invalid user admin from. The important piece of information that we want to isolate on each of these lines is the IP address. One way we could do that is to split the line on the word from, so let's grep for failed. We'll pipe this to owk and use the word from as a field separator. And actually, I'm gonna use from space, so we don't end up with that extra space here. And then we'll just print the second part of that. So we should end up with the IP address and the rest of the information on the line. Let's see what happens. Now we're left with four columns, all separated by a single space. The first column is the IP address, the second column is the word port, followed by the actual port number itself and then finally the protocol, which here is SSH2. Now from here, we can print the first column either with cut or owk. So let me just do this, pipe this to owk print dollar sign one and that gives us the IP address or instead of using owk here, we can use cut. Cut-d and we're going to separate on a space, so that's our delimiter and we want the first field. And so we get the same output. So let's go back and talk about another way to solve this problem. If we count the number of fields from the left to right, we end up with the IP address being in different fields, depending on whether or not the user was valid or invalid. But if we count the number of fields from the right to left, or from the end of the line towards the beginning of the line, you'll see that the IP address is always the fourth column from the end. That means we can use owks' special variable of NF, which represents the total number of fields on a line and then do a little subtraction to end up with the IP address. So here we can do this, owk print and then we're going to take the number of fields and subtract three from that number and we should end up with a column that has the IP address in it. So I've demonstrated two ways to extract the IP address. There are other ways and if you came up with another way, that's totally fine, as long as you have extracted the IP address from all those lines in that file. So this approach that I used here makes the most sense to me and it seems a little bit simpler, so that's what I'm going to use going forward. We know we can use the unique command to count the number of occurrences of a line. We also know that unique require sorted input, so let's first sort our list of IP addresses and then send it to unique. So I'm just going to sort this list. Doesn't have to be a numeric sort, it can be any sort as long as it's sorted. Unique doesn't care. So then what we can do here is pass it into unique and tell unique to count the occurrences of each line. And of course, the only thing that are on these lines are IP addresses, so it'll count the occurrences of these IP addresses. Now that we have this bit of data, let's sort this numerically. So we'll just pipe this to sort-n. Actually let's reverse this order and put the most failed attempts at the top of the list. So we can just add a -r to our sort command to reverse it and we end up with the most failed login attempts first and the least failed login attempts last. By the way, this sample log file contains entries from just one day. That means there were 6,749 failed login attempts from the IP address of Now this could mean a couple of different things. The first thing that comes to my mind is someone was performing a brute force attack. However, another possibility is that something is wrong with an account that we're using for some sort of automated process. Perhaps one of our servers in another data center is connecting to the system over SSH to do some work but maybe the SSH key was accidentally changed or deleted or the password for the account was changed, or some other configuration issue has cropped up here. So what I'm going to do is find the location of this IP address. Now there's a command called geoiplookup that returns the location of an IP address and so let's run that now. That IP address is associated with China. If you happen to have servers in China or people who work from China, this still might not be an attack but just a misconfiguration of some sort. However, let's assume our people only work in the United States, Canada and Europe. Also, let's assume our data centers are located in New York, London and Amsterdam. In this particular case, I would interpret this activity as a brute force attack. It would be nice to have this location information for any IP addresses who fail to log into our servers more than let's say 10 times. If we look at the data we have, we have two columns, account in column one and an IP address in column two. We could loop through this output and test to see if the count is greater than 10 and then if it is, perform the geoiplookup on that associated IP address. Now let's take the command we worked out here on the command line, put it into a script and start working on this last bit of logic. We'll give our script a header here. What I'm going to do is actually use a variable to define our limit and that way, if we decide our limit changes in the future, we can quickly update that variable at the top of our script. Like I said, I'm going to use a variable. You could actually create this script with an option and have the user specify that, if you'd like but I'm just gonna keep it simple and leave it at a hard coded number here. However, what we are going to do is ask the user to provide us a file and so that will be the first argument on the command line and if it doesn't exist, we need to tell them about it. Okay, this is our little check here. If the file doesn't exist or we can't open it or read it, then we're going to tell them that we can't open the file they provided or it's gonna be blank if they don't provide a file and then we're going to exit with an exit status of one. Now what we need to do is loop through this data that we generated here. So I'm just going to paste that command we worked out and you remember that the command generated two columns of data, the first column being account and then the second column being an IP address. So what we can do is pipe this to while read assign the first column to the variable name count and the second column to a variable named IP. So here, we can just do a simple check here. If the count is greater than the limit we set, then we're going to determine a location. We'll use that geoiplookup command against the IP address and then we'll display this information. One of my common typing mistakes here is to put the dollar sign outside of the quotation marks instead of inside and I've corrected that here. Let's see if I have any more. I'm just gonna go to the top of the script real quick and look. Okay, that looks good. Let's go ahead and save our changes and test out our script. We'll run it against the sample data we have here. So this is a good start but we need to clean up this output a bit. First, let's remove that little bit of redundant information there GeoIP Country Edition. Let's remove those words from the geoipookup output. We could use cut and use a comma as the field separator but that would leave a space before the country. However, we could use owk and include the space after the comma as the field separator, so let's do that instead. While we're here, let's turn this output into CSV output and let's display a header. We'll display the count, the IP address and the location here for our header. Now we're going to change this geoiplookup output. We're going to pipe that to owk with a field separator of a comma and a space and then that will leave us with the data we need here in the second field. And also while I'm here, I'm going to change this echo command here to have commas in between this data. And that looks good. Let's see what this brings us. Show attackers, syslog-sample. Okay, great. So we have a header of count IP location. Then the first line we have account of 6749 with an IP of and a location of China. So this is exactly how we want our output to be displayed. While we're here, let's make sure that our little file tests that we wrote at the top of our script works. So let's provide some fake data to it. Okay, that looks good. We get an exit status of one and let's make sure that if we don't provide a file that this also works. Sure enough, it says, "Hey I can't open a file that doesn't exist" and we also get an exit status of one. So this completes the exercise walkthrough for this script.

About the Author
Learning Paths

Jason is the founder of the Linux Training Academy as well as the author of "Linux for Beginners" and "Command Line Kung Fu." He has over 20 years of professional Linux experience, having worked for industry leaders such as Hewlett-Packard, Xerox, UPS, FireEye, and Nothing gives him more satisfaction than knowing he has helped thousands of IT professionals level up their careers through his many books and courses.

Covered Topics