AWK for Text Processing
- Teaching: 20
- Exercises: 5
- Questions:
- How do I print specific columns from a text table?
- How can I use patterns to select only certain lines in a file?
- How do I count lines or matched lines in a file?
- Objectives:
- Select and print fields with
$0,$1,$2,$NF, andNF. - Use a field separator with
-Fto handle CSV input. - Match lines using simple regex like
/^ATOM/. - Count total or matching lines with a counter and the
ENDblock. - Explain the difference between
wc -landawk 'END {print NR}'for line counting.
- Select and print fields with
If we need to count the number of lines in a file, we can use the previously showed command for word counting wc
wc -l example.txtAs you probably remember, -l is an option that asks for the number of lines only.
However, wc counts the number of newlines in the file, if the last line does not contain a carriage return (i.e. there is no emptyline at the end of the file), the result is going be the actual number of lines minus one.
A workaround is to use awk. awk is command line program that takes as input a set of instructions and one or more files. The instructions are executed on each line of the input file(s).
The instructions are enclosed in single quotes or they can be read from a file.
Example:
awk '{print $0}' example.txtThis command has the same output of cat: it prints each line from the example.txt file.
The structure of the instruction is the following: - curly braces surround the set of instructions - print is the instruction that sends its arguments to the terminal - $0 is a variable, it means “the content of the current line”
As you can see, the file contains a table.
Awk automatically splits the processed line by looking at spaces: in our case it has knowledge of the different columns in the table.
Each column value for the current line is stored into a variable: $1 for the first column, $2 for the second and so on.
So, if we like to print only the second column from the table, we execute
awk '{print $2}' example.txtWe can also print more than one value, or add text to the printed line:
awk '{print "chr",$2,$4}' example.txtThe comma puts a space between the printed values. Strings of text should be enclosed in double quotes. In this case we are printing the text “chr”, the second and the fourth column for each row in the table.
So, $0 is the whole line, $1 the first field, $2 the second and so on. What if we want to print the last column, but we don’t know its number? Maybe it is a huge table, or maybe different lines have a different number of columns.
Awk helps us thanks to the variable NF. NF stores the number of fields (our columns) in the row. Let’s see for our table:
awk '{print NF}' example.txtWe can see that some lines contain 6 fields while others contain 7 of them. Since NF is the number of the last field, $NF contains its value.
awk '{print "This line has",NF,"columns. The last one contains",$NF}' example.txtOut there we have different file formats: our data may be comma separated (csv), tab separated (tsv), by semicolon or by any other character.
To specify the field separator, we should provide it at command line like:
awk -F "," '{print $2}' example2.txtIn this case, we are printing the second field in each line, using comma as separator. Please notice that the character space is now part of the field value, since it is no longer the separator.
Pattern-Action Model
AWK reads a file line by line, splits each line into fields, and then applies pattern { action } rules.
Maybe we would like to perform different instruction on different lines. Awk allows you to specify a matching pattern, like the command grep does.
Let’s look at the file content
awk '{print $0}' example.pdbIt seems an abriged PDB file. If we would like to print only lines starting with the word “ATOM”, we type:
awk '/^ATOM/ {print $0}' example.pdbIn this case, we specify the pattern before the instructions: only lines starting with the text “ATOM”. As you remember, ^ means “at the beginning of the line”.
We can specify more that one pattern:
awk '/^ATOM/ {print $7,$8,$9} /^HEADER/ {print $NF}' example.pdbIn this case, we are printing the spatial coordinates of each atom.
The special block END { ... } runs after all lines are processed. It’s ideal for printing totals collected while scanning.
NR is the current line number. After the last line, NR equals the number of lines read:
awk 'END { print NR }' example.txtThis avoids the missing final newline issue that can affect wc -l (if the last line lacks a trailing newline, wc -l may under‑count by 1).
To Count only matching lines, increment a counter inside the pattern, then report it in END:
awk '/^ATOM/ { count++ } END { print "ATOM lines:", count+0 }' example.pdb/^ATOM/matches lines that begin withATOM.count++adds 1 for each match.- In
END, we print the total.count+0safely prints 0 if there were no matches.
$0is the whole line;$1..$NFare its fields;NFis the count of fields.-Fsets the field separator (comma, tab, etc.).- Use
/pattern/ { action }to run code only on matching lines. - Increment a variable inside the action and print totals in
END {}. NRgives total lines read;wc -lcan undercount if last newline is missing.
| ← Previous |