AWK

What’s AWK?

AWK is a line-oriented programming language that has a long history, born in the the 1970s. This program is one of the UNIX tools I’ve been using more and more in the last year.

Why use AWK?

I often find myself parsing a lot of data. When working with JSON, I like using JQ. When working with CSV data my favourite tool is csvkit, a Python program. But a lot of data is still available in a non-standard (like CSV) line-oriented format.

One of the things I like about line-oriented data is that its easily steam-able. Its difficult to stream data over a protocol like JSON or YAML since the program has to be told precisely where to consider the boundary between one “record” and another. See JSON streaming for the details.

The records in line-oriented data, on the other hand, are easily delimited by each line. This will continue to be the easiest way to stream large amounts of data till the programming profession rediscovers the wisdom of the ancients when they encoded delimiters in ASCII.

Until that happy day, in circumstances where I’m working with line-oriented data, I often find mysef drawn to the UNIX toolkit of grep, sed, and cut. These tools are featureful, standardized, ancient, battle-tested, heavily documented, and easily google-able. But sometimes I find myself streaming (piping) the output of one command into another. Sometimes those pipes get a little out of hand, going over 5. That starts looking a little ridiculous; its a code smell to me.

That is when I reach for AWK. AWK runs each line through a series of commands. Each command attempts to match the line to a regex (or other) rule. If the rule matches, AWK runs the body of the command. Its a conceptually very simple model, and its fast. A lot faster, say, than piping each line into Ruby and processing the lines manually. Once again, I find the old ways to be better.

Example use-cases

Filter on multiple whitespace-delimited portions of a logfile

If you are handed a logfile and want to limit the output to where the response code (field 4) was 4XX and the URL (field 3) had “medium=water” in it, you could grep for something like “medium=water” and then pipe that into another grep of something like “((.+) ){4} 4\d\d” (not tested, I’m just giving an example).

In AWK, you’d just have two filters (rules) that would look like

    $3 !~ /medium=water/ { next }
    $4 ~ /4\d\d/ { print $0 }
  

which says “skip this line if the 3rd field does not match the regex” and “print this line if the 4th field does match the regex”. This can also be represented as a one-liner as

    $3 ~ /medium=water/ && $4 ~ /4\d\d/ { print $0 }
  

which means “print this line if the 3rd field matches the first regex and the 4th field matches the second”.

If you want the count of it, then you’d pipe that into “wc -l”. Or you could spare the “wc -l” and just have AWK build a running total.

    BEGIN { count = 0 }
    $3 ~ /medium=water/ && $4 ~ /4\d\d/ { count += 1; print $0 }
    END { printf("Finished parsing, the total count of matched lines is %d", count)
  

Easy-peasy beautiful.

Calculating running totals

Consider the case where you want to do something more complicated, like check another field (field 7) that logs the state such as solid, liquid, or gas, and you want to get the counts for each state. If you were using just “grep | wc -l”, you’d have to run through your input 3 times, matching on the different state each time. Or you could run through the input just once and have AWK collect running totals for each state

    BEGIN { solid = 0; liquid = 0; gas = 0 }
    $3 !~ /medium=water/ { next }
    $7 ~ /solid/ { solid += 1 }
    $7 ~ /liquid/ { liquid += 1 }
    $7 ~ /gas/ { gas += 1 }
    END { printf("Finished parsing, the total count of:\nsolid: %d\nliquid: %d\ngas: %d\n", solid, liquid, gas)
  

Leave a Reply