In this course, we'll explore a range of options for manipulating data. We'll look at some commands that can be used in shell scripts to transform and process data. We'll then walk you through an example of parsing through log files to find data that matches certain criteria. Finally, you'll learn how to manipulate data and transform text with sed.
This course is part of the Linux Shell Scripting learning path. To follow along with this course, you can download all the necessary resources here.
Learning Objectives
- Learn a range of commands used for transforming, processing, and reporting data including cut, awk, sort, and uniq
- Learn how to parse log files for failed login attempts and identify the IP addresses of those
- Manipulate data and transform text with sed
Intended Audience
- Anyone who wants to learn Linux shell scripting
- Linux system administrators, developers, or programmers
Prerequisites
To get the most out of this course, you should have a basic understanding of the Linux command line.
Let's say we want a list of port numbers that are open on our local system without any extra data around it we just want, number 22 to be displayed. If we have SSH listening on port 22, for example. The netstat command can display open ports and instead of me showing you the main page for it, I'm just gonna walk you through the options that we're gonna use today. So we'll use the netstat command. We'll use the dash-in option, to display numbers instead of names. So, instead of displaying SSH or SSHD it will display 22 for port 22. We can use U to get information on UDP, and use T to get information on TCP, and dash L for listening port. So when we run that command, we have this output. So in this particular case we have two lines that comprise the header. So there are a couple of different ways to manage this. One way would be to pipe it to grep, and do grep dash V, which we know how to do, and then use a pattern like this. So that will get rid of the first line, and then we could continue this by using grep again and we could get rid of the second line like this. So that will leave us with just a pure data, without the header. Now I didn't really talk about this in grep but what you can also do is use extended regular expressions with a dash capital E option. And that allows you to do something like this. The pipe symbol and regular expressions is an or so, if we match active or proto, then we will have made a match. And if we hit enter, you can see that it gets the same results as using those two grep commands above. So it's a little bit shorter, some people like that. Again that's just one way to solve this problem. Now if we're looking at the data that we have left, there are some things in common with each line. Some things that are not in common, some lines have TCP some have TCP 6, some UDP some have listen, some have numbers and so on. But one thing that is constant throughout here are, colons. So again let me run this without, any decoration or filtering. So we see the first two lines here, do not have any colons, so we could do something like this, grep for a colon, and that would just display all the lines ahead of colon in it. So in either way the goal here is to get the data without the header. Now let's just go ahead and proceed this way. So let me tell you that the data we're looking to extract online, one, we wanna pull out 22 because that's the port. The second line there we wanna pull out 25, which is the port for that. At first though you may be thinking we can just split this on a colon, and print the second field. Well let's just try that and see what happens. We'll do cut, dash D for a colon as a delimiter, and we'll print the second field. So that doesn't give us exactly what we want because we have some extra data on some lines and then on other lines, we have blank lines. So as you'll notice up here this line has three colon so when we do dash F two, we get nothing to return because there's no data in the second field for that line. So using the cut command in that way it's not going to work. Now if we go back to our original data here, we see that all the ports that we're interested in are actually in the fourth column. And that column being local address. You can also see that these columns aren't separated by a consistent number of spaces, for example, so that rolls out cut. But we know that arc handles these kind of white space situations well. So we're gonna use that to pull out the fourth column. So we'll narrow it down to the fields that have the data that we're looking for, and then we'll print the fourth column even further narrowing down the data. Again, we're left with a similar situation where cut wouldn't work, because on every line the port numbers that we're after, are on the end of the line. So colon 22, colon 25, colon 22, colon 25, colon 68 colon 7755, and colon 26314, Now I have the answer there in my description of the data which is every one of those is proceeded by a colon. So we could actually use that as a delimiter, and just print dollar NF, which is the last field. So if we have colon colon colon 22, dollar NF is not gonna be the same as, some numbers and then colon 22. So at one case we have two fields in the other case we have more than two fields. So let's use awk, specify a colon as the field separator, and then what we'll do is just print the last field on that line with dollar NF. So now we are left with the data that we want. Now, if we just go back and run our regular netstat command here, we can see that we're getting data from TCP and TCP 6 so TCP V4, and TCP V6 and the same with UDP. So one way we can just get TCP V4, is to use the dash four option to netstat. So we can do this. And that eliminates the TCP V6 data there. I'm not really using TCP V6 so I'm not really worried about that. So let's break it down with this set of data, the same idea just extracting the ports that are listening. So here again, we can use grep to either exclude the headers or include the data, something that's common. And we know that a colon is common in the data so that leaves us with the data. And then we can do something like, pull out the fourth field with awk. And now in this particular case we are left with two columns of data. The first column being an IP address and the second column being a port and those are separated by a colon. So this would be an ideal situation where we could use cut. So we could use cut, dash D colon and get the second field. Now we're left with 22, 25, 68 and 7755 on this particular system. It may not look this way on your system. We could also use the awk command in this situation as well. It doesn't have to be cut we could use arc with the field separator of a colon, and then print dollar sign too, for example, that works. Or we can even go back to our original command and use dollar sign NF for the last field. Even though we know there's two fields, this will still work. So in this regard, it works with dash four, and without the dash four it also works with TCP V6, and TCP V4. Let's put this little command into a script so we can have it when we need it later. This way we don't have to solve this same problem again in the future. So I'm really just gonna copy this and put it in a script. We're going to allow a dash four option as we were working with netstat so as to only display the TCP V4 ports. So if I paste our command in here this gets us 99% of the way. The last piece here is if we pass in this dollar sign four. The simplest way to do this would be to do this. Just pass along whatever was passed to our script, to the netstat command itself, without doing any checks. So if this is a personal script that I'm not sharing and I'm not worried about, there's a good chance that I would leave it as it is. Let's try it out and then let's talk about some safeguards that we could put in place here. So we need to make sure it's executable, and then run it and it shows us our ports, and then if we use dash four then it limits those ports to the V4 protocols. But if we do something like this, then we get an error from netstat because dash blah is not a valid option to netstat. And then we could also send all kinds of different things to netstat, that we may or may not want to do. So if you wanna be more exact, you could do something like this. If dollar sign one is equal to dash four, then and do something like that, you can put a check like that. I'm not gonna do that, I'm gonna leave that as it is, and that works for our situation here. One last thing that isn't exactly related to scripting but it's kind of useful to know, it's the dash P option to netstat. And it displays the pit and the name of the program that has the port open. But to get that information you need to run netstat with super user privileges. So we can do this. Actually let's limit this data down a little bit. What this shows is that we have SSHD with a pit of 909, listening on port 22 on our system. So that's just a handy command to know. Okay enough random tips, let's get back to what you learned here today. In this lesson, you learned how to extract sections, from input, using the cut command. You learned how to cut based on byte, character, and field. You also learned how to specify a delimiter, so you can easily work with CSV files, password files and any other type of data that is organized in columns. You also learned about the carried and dollar sign anchor regular expressions, and use them with a grep command. Finally, we spent some time dealing with awk, and how it can handle multi character delimiters, and white space better than cut.
Jason is the founder of the Linux Training Academy as well as the author of "Linux for Beginners" and "Command Line Kung Fu." He has over 20 years of professional Linux experience, having worked for industry leaders such as Hewlett-Packard, Xerox, UPS, FireEye, and Amazon.com. Nothing gives him more satisfaction than knowing he has helped thousands of IT professionals level up their careers through his many books and courses.