AWK for Text Processing

Overview

Teaching: 20
Exercises: 5
Questions:
- How do I print specific columns from a text table?
- How can I use patterns to select only certain lines in a file?
- How do I count lines or matched lines in a file?
Objectives:
- Select and print fields with $0, $1, $2, $NF, and NF.
- Use a field separator with -F to handle CSV input.
- Match lines using simple regex like /^ATOM/.
- Count total or matching lines with a counter and the END block.
- Explain the difference between wc -l and awk 'END {print NR}' for line counting.

If we need to count the number of lines in a file, we can use the previously showed command for word counting wc

wc -l example.txt

As you probably remember, -l is an option that asks for the number of lines only.

However, wc counts the number of newlines in the file, if the last line does not contain a carriage return (i.e. there is no emptyline at the end of the file), the result is going be the actual number of lines minus one.

A workaround is to use awk. awk is command line program that takes as input a set of instructions and one or more files. The instructions are executed on each line of the input file(s).

The instructions are enclosed in single quotes or they can be read from a file.

Example:

awk '{print $0}' example.txt

This command has the same output of cat: it prints each line from the example.txt file.

The structure of the instruction is the following: - curly braces surround the set of instructions - print is the instruction that sends its arguments to the terminal - $0 is a variable, it means “the content of the current line”

As you can see, the file contains a table.

Awk automatically splits the processed line by looking at spaces: in our case it has knowledge of the different columns in the table.

Each column value for the current line is stored into a variable: $1 for the first column, $2 for the second and so on.

So, if we like to print only the second column from the table, we execute

awk '{print $2}' example.txt

We can also print more than one value, or add text to the printed line:

awk '{print "chr",$2,$4}' example.txt

The comma puts a space between the printed values. Strings of text should be enclosed in double quotes. In this case we are printing the text “chr”, the second and the fourth column for each row in the table.

So, $0 is the whole line, $1 the first field, $2 the second and so on. What if we want to print the last column, but we don’t know its number? Maybe it is a huge table, or maybe different lines have a different number of columns.

Awk helps us thanks to the variable NF. NF stores the number of fields (our columns) in the row. Let’s see for our table:

awk '{print NF}' example.txt

We can see that some lines contain 6 fields while others contain 7 of them. Since NF is the number of the last field, $NF contains its value.

awk '{print "This line has",NF,"columns. The last one contains",$NF}' example.txt

Field separator

Out there we have different file formats: our data may be comma separated (csv), tab separated (tsv), by semicolon or by any other character.

To specify the field separator, we should provide it at command line like:

awk -F "," '{print $2}' example2.txt

In this case, we are printing the second field in each line, using comma as separator. Please notice that the character space is now part of the field value, since it is no longer the separator.

Pattern-Action Model

AWK reads a file line by line, splits each line into fields, and then applies pattern { action } rules.

Maybe we would like to perform different instruction on different lines. Awk allows you to specify a matching pattern, like the command grep does.

Let’s look at the file content

awk '{print $0}' example.pdb

It seems an abriged PDB file. If we would like to print only lines starting with the word “ATOM”, we type:

awk '/^ATOM/ {print $0}' example.pdb

In this case, we specify the pattern before the instructions: only lines starting with the text “ATOM”. As you remember, ^ means “at the beginning of the line”.

We can specify more that one pattern:

awk '/^ATOM/ {print $7,$8,$9} /^HEADER/ {print $NF}' example.pdb

In this case, we are printing the spatial coordinates of each atom.

The special block END { ... } runs after all lines are processed. It’s ideal for printing totals collected while scanning.

NR is the current line number. After the last line, NR equals the number of lines read:

awk 'END { print NR }' example.txt

This avoids the missing final newline issue that can affect wc -l (if the last line lacks a trailing newline, wc -l may under‑count by 1).

To Count only matching lines, increment a counter inside the pattern, then report it in END:

awk '/^ATOM/ { count++ } END { print "ATOM lines:", count+0 }' example.pdb

/^ATOM/ matches lines that begin with ATOM.
count++ adds 1 for each match.
In END, we print the total. count+0 safely prints 0 if there were no matches.

Challenge: Counting and Selecting (Simple)

Using only the ideas covered above (field selection, patterns, NF, and END):

Write an awk command that prints the number of lines in example.txt.
Write an awk command that prints the number of lines in example.pdb that start with ATOM.
Write an awk command that prints the last field of each ATOM line in example.pdb (just the values, one per line).

Bonus (optional): Print both the count of ATOM lines and, at the end, the total number of characters across all those last fields.

Solution

Total lines (robust):

awk 'END { print NR }' example.txt

Count lines starting with ATOM:

awk '/^ATOM/ { c++ } END { print c+0 }' example.pdb

Last field of each ATOM line:

awk '/^ATOM/ { print $NF }' example.pdb

Bonus (count and accumulate character lengths of last field):

awk '/^ATOM/ { c++; total += length($NF) } END { print "ATOM lines:", c+0; print "Total chars in last field:", total+0 }' example.pdb

Explanation: - NR gives total lines after reading the file. - /^ATOM/ pattern restricts actions to lines starting with ATOM. - $NF is the last field; length($NF) measures its size. - Counters (c, total) are printed in END.

Key Points

$0 is the whole line; $1..$NF are its fields; NF is the count of fields.
-F sets the field separator (comma, tab, etc.).
Use /pattern/ { action } to run code only on matching lines.
Increment a variable inside the action and print totals in END {}.
NR gives total lines read; wc -l can undercount if last newline is missing.

← Previous