1. Home
  2. Training Library
  3. Cloud Computing Fundamentals
  4. Courses
  5. Shell Scripting: Transforming, Processing, and Reporting Data

Shell Scripting: Transforming, Processing, and Reporting Data

The course is part of this learning path

Cut and Awk
Overview
Difficulty
Intermediate
Duration
1h 33m
Students
73
Ratings
5/5
starstarstarstarstar
Description

In this course, we'll explore a range of options for manipulating data. We'll look at some commands that can be used in shell scripts to transform and process data. We'll then walk you through an example of parsing through log files to find data that matches certain criteria. Finally, you'll learn how to manipulate data and transform text with sed.

This course is part of the Linux Shell Scripting learning path. To follow along with this course, you can download all the necessary resources here.

Learning Objectives

  • Learn a range of commands used for transforming, processing, and reporting data including cut, awk, sort, and uniq
  • Learn how to parse log files for failed login attempts and identify the IP addresses of those
  • Manipulate data and transform text with sed

Intended Audience

  • Anyone who wants to learn Linux shell scripting 
  • Linux system administrators, developers, or programmers

Prerequisites

To get the most out of this course, you should have a basic understanding of the Linux command line.

Transcript

In this lesson, you'll learn how to use the CUT and awk commands. The CUT command is used for cutting out sections from each line of input it receives and displaying those sections to standard output. You can use CUT to extract pieces of a line by byte position, character position, or by a delimiter. This makes CUT ideal for extracting columns from a CSV file for example. CUT is not a shell built in, it's a standalone utility. So we can use the man command to get some information on this command line utility. You can perform cuts by bytes by using the dash B option, by characters with a dash C option and by fields by using the dash F option. For each one of these options you'll need to supply a range. Ranges are pretty simple, and I'll demonstrate them with a couple of examples. Also you'll probably use the dash D option to specify a delimiter when using the dash F option unless you are working with Tab delimited data. Okay, I'm just gonna use the texts that already exist in the ETC password file and then use the CUT utility to cut it up or slice it up in different way. So let's just look at the contents of that file now. Let's start out by cutting the password file by character. To print the first character of each line we'll use dash C1, and one here is the range that we're specifying. So we'll do CUT dash C1 ETC password. And if you notice here at the bottom, we have a Vagrant user and a Vbox add user at the bottom of our screen. So when we execute this command the last two lines should begin with the letter V for example. Okay sure enough, it's V on the last two lines, just like we anticipated. If you supply a single number then only that number is displayed. So to display the seventh character of each line, use dash C7. Okay, that's not too useful. This next one isn't too useful either, but hang in there with me. We'll do some actual work in just a minute. You can specify a starting position and an ending position by connecting them with a hyphen. So to cut out characters four through seven we could use dash C4 dash seven. It's important when you're specifying a range that you do not use spaces. There is no space between four, the dash and the seven in our command. So just keep that in mind. Okay, let's say you want to display every character on a line, starting with character four. To do that use a range of four dash. This is useful if you don't know how long each line is or if the lines are of varying links. So cut dash C four dash ETC password. You can do the opposite, which is to display every character up to and including a position. So to display the first four characters you can use dash C dash four By the way, that range is exactly the same as one dash four. And we get the same output. So here's one last range. You can pick out multiple individual characters by separating them with a comma. For example, to print the first, third and fifth characters use one comma three, comma five. It's important to point out here that CUT won't rearrange the order even if you specify a different order in the range. For example, this command that I'm about to type generates the exact same output. But dash C5 comma three, comma one. There are a couple of different methods to rearranging the data, but we'll get to that later. And one more thing, if you supply a range that doesn't match anything, then you'll get a blank line. So let's try to print the 999th character and the password file. There isn't a 999th character in any of the lines so there is nothing to display for the line and you end up with a blank line being displayed and said. Okay, so that really covers how to use ranges. You'll notice that I've been using the dash C option to cut on characters, but you can use the dash B option to cut by byte. To display the first byte of every line in the ETC password file, well, we would use dash B1. That's the same in this particular case as dash C1. However, a bYte is not always the same as a character because there are some characters that are made up of multiple bytes and are thusly called multi-byte characters. For example, UTF eight characters are multi-byte characters. Let me display a multi-byte character to the screen with echo. To display the first character in that string used dash C1. So we'll just pipe the output of echo as the standard input to the cut command and do dash C1. So what we did there with a pipe is pretty standard and you've been doing it a lot throughout this course. I just wanted to be explicit and say that you don't have to supply a file for CUT to operate on some data. You can also use standard input as well. Okay, so back to the difference between a byte and a character. So dash C1 prints the the nu. And I hope I pronounced that correctly. Apologies to anyone who speaks Spanish. But look what happens when we only print the first byte with dash B1. The first byte is displayed of the multi-byte nu character. In most cases, this is not what you want. Just something to keep in mind. Let's move on to the dash F option. It allows you to cut lines by field. By default dash F splits on a tab. Anything before a tab is considered to be the first field. Anything after the first tab and before the second tab is considered to be the second field and so on. Cut uses the term field but you can think of these as columns if you wish. Let me generate some tabs and limited data with the echo command. The dash E option two echo allows you to use some backslash escapes that allow you to do some things like generate a tab character, a new line and so on. Backslash T represents a tab. So we can do this echo dash E1 and then forward slash T will produce a tab. We'll use the word two forward slash T is another tab and we'll use the word three. So if we want to display just the first field we can do this, CUT dash F one. And to display the second field CUT dash F2, third field is of course CUT dash F3. So what happens if you have data that is not tab separated? Let's say we're dealing with a CSV or a comma separated value file. In this case, you need to tell the CUT command what to use as the delimiter. Here it's a comma. At the first field, the second field and the third field. Sometimes you'll see people do what I'm about to do here which is to not use quotes around the delimiter when specifying that delimiter. So you might see something like this. CUT dash D, comma dash F2, for example. You may also see people not put a space after the dash D and do this. Either one of those methods work, as long as it's a character that's not used or interpreted by the shell. If you're trying to do this, it won't work. So we'll do this. Here you have to quote the forward slash otherwise the shell interprets it as a line continuation character. It's just one of those little gotchas that can happen and that's why I suggest you always quote your delimiter. The password file is actually made up of a series of columns or fields all separated by a colon. So let's print the username and UID of every user in the password file. So we'll specify the delimiter as a colon and say give us fields one and three from the ETC password file. Notice that the output is delimited by the original delimiter. To change it use the dash-dash output dash-delimiter option. So let's change it to something else here. Here's a common situation that you'll face. You'll have a CSV file with a header or some other type of data that contains a header. Let me create a CSV file on the fly here. When you do this, you get the header and the output. So you have two choices. The first one is to remove the header before you send the data to CUT or remove it after CUT has done its work. Before we do that, let's review the grep command quickly. By default grep displays matches to a pattern that you supply. So if we look for the pattern of first it will display the line or lines that match that pattern. So we do grep first people and here we have three matches. And notice that it doesn't display any the other lines that do not match. Let's narrow down our search such that it only matches the header. You can do that by supplying more information or a more exact pattern. If you wanna be exact you can use regular expression anchors. Speaking of regular expressions, what I'm about to show you are my two most commonly used regular expressions ever. If you never learn any more about regular expressions you'll have at least these two very important ones at your disposal. The first regular expression is a carrot symbol. It matches the beginning of a line. It matches a position and not a character. So if we wanna match all the lines that start with first use carrot first, like so Notice that the results returned are different when you do not use the carrot character. The second regular expression is the dollar sign. It matches the end of a line. It too matches a position and not a character. So if you wanna find all the lines that end in T we can do this, grep T dollar sign. So to force an exact match, you can start your pattern with a carrot and end it with a dollar sign. Now we have isolated the header of the file but we want everything except that. Luckily, grep has a handy option that inverts matching. That option is dash V. The dash V option makes grep display any lines that do not match the pattern supplied. Now that we've removed the header, we can send it to CUT. Another option is to perform the CUT first then remove the header. I don't like this as much because CUT alters the output first making the header change too but it does work. So we can do this. That's what we get that contains the header and that removes the header. By the way CUT only handles single character delimiters. This is fine in most cases but there might be occasions where you would want or need to split on multiple characters. Take this example. At first glance you might think, Oh, I can just split this on the colon. Let's try that and see what happens. That leaves the string data which really should be considered part of the delimiter. It's not part of the actual data itself. It's a pointer to the real data. What you would really like to do is this. But as you can see, that doesn't work. We can do that with work awk however. Now I'm not saying that this is the only way to handle the situation, but it is one way. Plus it gives me a chance to briefly cover awk, Every good shell script should at least be aware of awk. Let me just give you the answer first and then I'll explain it to you in just a second. This is an entire awk program on a single line. The dash capital F option allows you to specify a field separator. We're telling it to use data colon as the field separator. The entire program is contained in the next set of single quotes. The braces in awk mean an action. This makes awk do things or take actions. The action we want awk to take is to print. As you probably have figured out by now, dollar two represents the contents of the second field. So dollar sign one is the data in the first field, dollar sign two is the data in the second field and so on and so forth. Let's go back to a previous CUT example. So here we're displaying the first and third fields from the ETC password file. To do something very similar in awk you can do this. Here you see that awk separates dollar sign one and dollar sign three with a space. That's because the comma and the print statement represents the output field separator. By default the output field separator is a space. In awk if you leave out that comma then the fields just run together like this. Let's go back to our print statement here. I'll just execute this. awk has a special built-in variable named OFS, that's capital OFS. And that stands for output field separator. You can change the default from a space to anything you would like by changing the value of that variable. To change a variable in awk use the dash V option and then perform the variable assignment. So to change OFS to a comma, we can do this. So we'll go here at a dash V option, set OFS equal to a comma and hit Enter. To be clear, it's definitely the OFS variable that controls the output delimiter being displayed and not the space used in the print statement. We get the same results even if we do this. Let me just add a bunch of spaces like in between here and some here. When I hit enter, you're gonna see that the data is exactly the same. Instead of setting the OFS variable you can just give print a string to print, like so. The string we're going to print is a comma and then we'll specify field three here. If you want a space after the comma, for example, just add that space in your string. So let me go back up here and just put a space in my string and hit Enter. awk's really lenient with spacing. So this is the exact same command. I can run dollar sign one right up against the string and dollar sign three there or I could put a lot of space in between these like this, and you'll see that that doesn't really affect the execution of awk. That may be clear to you or it may not. Now let's add some more tax to our print statement here. Let's do that. Let's say we wanna print column one. So hopefully that gives you an idea of how you can use strings with your print statements and how the output field separator works. So if you remember earlier, I said you can't control the order of the data being displayed with cut. So let's take this example. It displays the fields in the order that they appear from the input. With awk you can change it like so. We just tell it to print the third field first and then the first field. Then you can combine it with any other additional strings you want. So maybe we want to say, this is the UID, and we'll separate that with a semi-colon log in, like so. In addition to dollar sign one, dollar sign two, dollar sign three and so on, awk gives us dollar sign NF which represents the number of fields found. So to print the last field for every line in a file use dollar sign NF. The password file is very uniform so using dollar sign NF, isn't exactly groundbreaking here. But if you are dealing with irregular data that doesn't fit nicely into columns you can often see something common in each line and then use dollar sign NF to shorten the data down or to normalize that data. Even if the number of fields is consistent in the data it might be easier to say print the last field or print dollar NF so you don't have to count the number of fields first. If you have a CSV file with 47 different fields or columns and need the last one, it's a lot quicker to use dollar sign NF instead of counting all those columns. You can do some math with awk too just surrounded in parentheses. So check this out. What this command does is prints NF minus one which is of course the second to the last field. So if they are seven fields in NF, seven minus one is six then it prints the sixth field, for example. Let's generate some irregular data. You can see that what we really have here is a file with four lines in it and each line is made up of two columns separated by varying links of white space, white space being spaces and or tabs. It would be really hard to make sense of this data was CUT because it only allows us to split on a single character. Even if we split on a space, we wouldn't end up with what we wanted because different lines have different number of spaces separating the columns. Also it wouldn't handle lines with tabs. However, awk performs really well in this situation. By default, the field separator for awk is white space. Or if you say it another, maybe even more accurate way, awk considers non white space characters to be a field by default. awk easily handles extraneous spaces at the beginning and end of each line, for example. So really those are the two times that I personally use awk. One is to use a delimiter that's comprised of more than a single character. And the other time is to handle field separated by white space.

About the Author
Avatar
Jason Cannon
Founder, Linux Training Academy
Students
3453
Courses
61
Learning Paths
8

Jason is the founder of the Linux Training Academy as well as the author of "Linux for Beginners" and "Command Line Kung Fu." He has over 20 years of professional Linux experience, having worked for industry leaders such as Hewlett-Packard, Xerox, UPS, FireEye, and Amazon.com. Nothing gives him more satisfaction than knowing he has helped thousands of IT professionals level up their careers through his many books and courses.

Covered Topics