1. Home
  2. Training Library
  3. Cloud Computing Fundamentals
  4. Courses
  5. Shell Scripting: Transforming, Processing, and Reporting Data

Sort and Uniq

The course is part of this learning path

Start course
Overview
Difficulty
Intermediate
Duration
1h 33m
Students
72
Ratings
5/5
starstarstarstarstar
Description

In this course, we'll explore a range of options for manipulating data. We'll look at some commands that can be used in shell scripts to transform and process data. We'll then walk you through an example of parsing through log files to find data that matches certain criteria. Finally, you'll learn how to manipulate data and transform text with sed.

This course is part of the Linux Shell Scripting learning path. To follow along with this course, you can download all the necessary resources here.

Learning Objectives

  • Learn a range of commands used for transforming, processing, and reporting data including cut, awk, sort, and uniq
  • Learn how to parse log files for failed login attempts and identify the IP addresses of those
  • Manipulate data and transform text with sed

Intended Audience

  • Anyone who wants to learn Linux shell scripting 
  • Linux system administrators, developers, or programmers

Prerequisites

To get the most out of this course, you should have a basic understanding of the Linux command line.

Transcript

In this lesson, you'll learn how to sort data using the sort and unique commands. Let's start out with some data we already have on this system, which is in the /etc/passwd file. If you wanna sort the contents of a file alphabetically, you can use the sort command. Here, you can see the line that starts with vboxadd is last because it is last alphabetically. If we look at the top of the output, we'll see what's first alphabetically. So let me just pipe this to less, and we see that adm is the first line because a is at the beginning of the alphabet and, for example, v is near the end of the alphabet. Let me hit q to get out of that view. If you want to reverse the order of the sort, use the -r option. So now, the adm user is last instead of first because we reversed the order. Let's see what happens when we use sort with numbers. I'm going to pull out the UID in the /etc/passwd file with the cut command. So we'll use a colon as a delimiter and the UID is in the third field. Now, let's use the output of the cut command as standard input to the sort command. So here, this is demonstrating that you don't have to run sort directly against files, it can accept standard input as well. So we'll do this through a pipe. This might not be what you expected. The list is sorted but not numerically. We have 7, 74, 8, 81, et cetera. However, when working with numbers, you probably want a numeric sort so that 7 comes first, then 8, then 74, then 81, and so on. To do that, we can use the -n option. Now, we have the smallest numbers first and the largest numbers last. Of course, you can reverse this order with the -r option. Let's talk about the du command quickly. It displays disk usage. So let's see how much space is being used in /var. By the way, they're gonna be some files in there that are not readable by our normal user, so I'm going to use sudo to give us root privileges to look in there. You'll notice two columns. The first column is a number that represents disk usage, and, by default, this number is in kilobytes. The second column is the related directory that is using that particular amount of storage. Now, let's perform a numeric sort to find out which directory in /var is using the most space. Of course, var itself is at the very bottom, since var contains all the sub-directories within it, so it uses the most space. But the one before that, /var/lib, is the sub-directory in var that is using the most space. And then, above that, /var/lib/rpm, /var/lib/yum, and so on. If we look at the /var/lib directory, it says it's using 92,744 kilobytes. If you don't want to see the size in kilobytes, you can use the -h option with du, which makes it print the sizes in a human-readable format. So now, we see at the bottom, 93 meg, 4K, 91 meg, 83 megabytes, and so on. If we try to sort this human-readable data, it's not really going to work how we would like it to work, and let's just demonstrate that now. So, at the bottom three lines, for example, you have 93 megabytes, then 96 kilobytes, and then 972 kilobytes, so it's not in the proper human-readable order. And we could even try this with the -n option, and, again, we have 700K, 800K, 972K, which are smaller amounts than those megabyte numbers that we were just talking about. Now, the good news is that sort has a -h option that performs a human-numeric sort. It understands that a number that ends in a G is a gigabyte, number that ends in a capital M is a megabyte, and so on. So now, we have 93M at the very bottom, then 91, then 83, then 5 megabyte, and so on, and in the Ks, the smaller sizes, are up at the top of this sorted list. So this sort -h works with human-readable numbers. In a previous lesson, we use the netstat command to display open ports, and let's just walk through that again. So we have netstat with the -n to display numbers instead of port names, u for UDP, t for TCP, and l for listening port. So we enter that, and then we can see we have some data here, and the data that we're most interested in is in the fourth column, and also we need to remove the headers, and so one way we talked about doing that was to look for something common in all the lines. And here, a common thing is a colon, so let me just grep for a colon. And now, we have that separated out, the header is out of our way. Again, like I said, the data that we're after is in the fourth column, so we can use awk print $4 to get that for us. Now, what we can do is get the port numbers with awk because they're all in the last field, fields being separated by colon in this instance. So we can do this, awk -F, separate by a colon, and then print $NF. So now, we're left with a list of ports. However, they're not sorted, so let's fix that. Since they're all numbers, we can use sort -n to sort numerically. Now, we have a sorted list, but there are duplicate ports. Sort can handle this situation too with its -u option, and -u stands for unique, and it only displays a line if it has not been previously displayed before. So let's use -n -u and hit Enter. So above we had 22, 22, 25, 25, but the output below when we used the -u option is 22, 25, and so on. By the way, we don't have to combine the -u option with the -n option. We can use -u on its own. So here, we have a unique list of ports. They're just not numerically sorted. In addition to sort's -u option, there's a command called uniq, that's spelled U-N-I-Q, which does something very similar to the -u option. With uniq, however, the lines coming to it have to be sorted in order for it to work because it only compares the current line to the previous line. So let me show you. So let's do a sort -n and then pipe that to uniq. So now, we have a unique list of ports with no duplicates. And just to show you that uniq doesn't work with an unsorted set of lines, let's do this. Let's remove out this sort here and see what happens then. So now, we have 22, 25, 22. Well, 22 is a repeat, and it didn't get extracted because it's comparing 22 to 25. But if we had sorted it, it would have been 22, 22, and then uniq would have noticed, "Oh, that's a duplicate. "So I'm only gonna print the first one "and not the second one." So at first glance, this might seem like an extra step, like, why would you ever wanna use the uniq command if you have to give it sorted data anyway and sort already has a -u option? Well, when you want to know how many occurrences of each line there were, use uniq -c. So let's go up here, our sort -n first to uniq, and let's add the -c option. The first column is the number of times the line appeared in the output followed by the line or the output itself. So here, we can see there were two instances of 22, two instances of 25, one instance of 68, and so on. Let's say you wanna find out how many syslog messages a program is generating, and you can do that by doing something like this. So let's look at the data we're working with, cat /var/log/messages. And if we count the fields here, one, two, three, four, five, the fifth field contains the application name or the program name that is writing to syslog. So let's pull that out. So here, you can see systemd command, then an su command, back to systemd. At the bottom, you see systemd-logind and so on. So let's sort this list. And now, let's feed this list to uniq and get a count. So here, we have some counts, one occurrence of lvm, eight of network, 86 of NetworkManager, and so on, but let's say we wanna sort this output. So let's take the output from uniq and run it back through sort. So we can do this, sort -n. So now, we know there were 342 messages generated from the kernel, 310 from systemd, and so on. You can apply this to all sorts of situations where you want to know how many occurrences of something there are. For example, if you want to know what IPs are hitting your web server the most, you can strip out the IP addresses, sort them, feed them to uniq -c, and then you'll end up with a count of hits by unique IP address. While we're on the subject of counting, I wanna spend just a quick minute here on the wc command. You can think of it as standing for word count, but it not only counts words, it can count bytes, characters, and lines. Personally, I end up using the line count option most often. So let's provide the /etc/passwd file as an argument to the wc command. The first column is the number of lines in the file, the second column is the number of words, and the third column is the number of characters. Just to be clear, wc really doesn't understand language. It considers a word to be any non-zero linked sequence of characters delimited by a white space. We can make wc display a word count with -w, just a byte count with -c, and finally just a line count with -l. This tells us that there are 25 accounts on the system because there's one account on each line in the /etc/passwd file. Let's say you wanted to know how many accounts are using the bash shell. First, you could display the lines that match the pattern bash with the grep command. Right, maybe this isn't the greatest example because we can quickly count that there are two lines, but let's say you have hundreds of accounts on a system and there's a lot more output than you can just visually see and recognize at that moment. Then, what you would want to do is let wc count the number of lines in the output, so you'd feed the output of the grep command into wc with the -l option. Now, I know someone's gonna bring this up if I don't put it in the video, so I just wanna be clear and say that, in this particular situation, you can also use the -c option for grep to perform a count. So we can do this grep -c for a count of how many lines contain bash in the /etc/passwd file. However, if you're not using a command that performs a count, then you will end up having to pipe that output to wc to perform the count for you. Okay, let's get back to sorting. There's one last option to sort that I wanna cover before we wrap things up. And that option is -k, which allows you to specify a sort key. So far, we've been sorting on the very first bit of data in a line. If you have data separated into multiple fields, perhaps you wanna sort on a field other than the first one. So let's go back to our passwd file. So let's just cat it. Now, let's say we want to sort the /etc/passwd file based on UID, and UID is in the third column with each column being separated with a colon. By default, sort uses white space as a field separator. So to tell sort to use a colon, we need to use the -t option. Then, we can use the -k option to provide a sort key. The simplest sort key is a number which represents the field to sort by. So cat /etc/passwd, tell sort to use a colon as a field separator, and tell it to sort on the third field. This third field happens to be comprised of numbers, so we're going to use a numeric sort with -n. Of course, we can combine this with other options, like -r for a reverse sort as well. So as we all know, the root account has a UID of zero, and then you can see the account here on this particular system with a UID of one is bin, daemon has a UID of two, admin has an ID of three, and so on. Let's do a little demonstration on how to analyze a web server log file using sort and uniq. Let's say you wanna know how many times a particular URL was visited. First, let's look at what we have to work with. So I have a access_log file here in /vagrant. My first goal is to extract the URL portion from the file. Now, there are multiple ways to do this. However, what I notice is that the URL is contained within a set of quotation marks. So let me split on that and see where that takes us. So we'll just feed this to the cut, we'll use double quotation marks as a delimiter, and we'll print the second field and hit Enter. By the way, I didn't have to cat that into cut, like I did here. What I can do is actually supply that file to cut. Let me just do that here now, cut -d f 2 access_log, and we get the same result. Now, it looks like we're left with three columns of data separated by a single space. The second column has the URL in it, and let's pull that out. So we can do this with a cut command as well. Again, if you saw things differently or think in a different way, that's perfectly fine. Perhaps your mind went to counting the column numbers first, something like this. So let me just cat the access log here. And the first column has an IP address, so that's column one. Dash is column two, another dash is column three. Okay, so one, two, three, four, five, six, seven. It looks like on the seventh column is where the URL is contained, and let's test that. So let's do, awk print $7 access_log. Okay, and so we end up with the same data. I just wanted to be clear that there's no one exact perfect way to do this, so just use whatever makes sense to you. And however you visualize the data, just keep extracting parts of it and transforming it until it looks like what you need. Okay, so let's continue on. Let me go back to my command here using the cut. Now, I want to count the number of times each one of those URLs was visited. So I know I can do that with a uniq command, and I also know that I need to provide uniq with sorted data first. So what I'm gonna have to do is pipe this through sort, and then I can go back and pipe it through uniq with the -c option to let it count it for me. So now, we have a count in column one and the URL in column two. So now, what we can do is actually sort this uniquely counted output with the sort command. So let me bring up my command and run this through sort -n. So it looks like we had 1,271 visits to /wp-admin, 1,265 to /explore, and so on. Now, let's say we only want to display the top three most visited URLs. And we can do this simply by displaying the last three lines of output with a tail command. So we'll take this big long command that we've been building up, and we'll pipe it yet again to another command, and this command is tail -3 to print the last three lines. So here, you see the three most visited URLs according to that access_log file. Now, I'm gonna go ahead and put that command into a script so I don't have to solve this same problem again. So I'm actually going to copy it. Then, I will edit my script here. Start out with a shebang. We'll tell what the script is doing. So we'll let the user pass in the log file. And we wanna make sure that that log file exists so we can do a quick check here. So this says, if not exists LOG_FILE, then we'll give them an error message. Now, I can paste my command that I copied earlier, and then I'll just change this to be the LOG_FILE variable. Okay, let me save my changes, make my script executable, and then give it a try. So pass no data. Well, it can't open anything. Okay, it can't open asdf because that doesn't exist. So let's actually give it a path to our file that does exist. And here we go, it runs the command against what we provided it, giving us the three most visited URLs in that file. So to recap, in this lesson, you learned how to sort data using the sort command. You learned how to use the -n option to sort numerically. You also learned how to use the -r option to reverse the sort order. From there, you learned how to display only unique lines using sort -u and the uniq command. You also learned how to count items with the wc command. Finally, you use the -t and -k options to sort data based on a specific field.

About the Author
Avatar
Jason Cannon
Founder, Linux Training Academy
Students
3386
Courses
60
Learning Paths
8

Jason is the founder of the Linux Training Academy as well as the author of "Linux for Beginners" and "Command Line Kung Fu." He has over 20 years of professional Linux experience, having worked for industry leaders such as Hewlett-Packard, Xerox, UPS, FireEye, and Amazon.com. Nothing gives him more satisfaction than knowing he has helped thousands of IT professionals level up their careers through his many books and courses.

Covered Topics