Text Processing

5 min read

Core idea

Unix took a strong design bet: most data worth processing can be represented as a sequence of newline-terminated lines, and most transformations can be expressed as a pipeline of small programs that each consume lines on standard input and emit lines on standard output. The text-processing toolkit is the practical payoff of that bet. Each tool does one verb — cat concatenates, sort orders, uniq collapses adjacent duplicates, cut slices columns, paste glues columns, join does relational joins on a shared key, tr substitutes characters, sed rewrites lines via a small editing language, diff shows differences, patch applies them. Almost any text problem becomes a composition of these verbs separated by |.

Shotts's argument: Most of system administration and a surprising fraction of data work reduces to plumbing these utilities together. The pipeline replaces the bespoke script you would otherwise write — and is faster to type, faster to run, and easier to debug because each stage is observable.

Why it matters

Composition beats one-off scripts

A pipeline like cat access.log | grep 404 | cut -d' ' -f7 | sort | uniq -c | sort -nr | head answers "what are my top broken URLs?" in one line — no scripting language, no temp files. The same six tools, recombined, answer dozens of unrelated questions. Investing time to learn them once pays back forever.

Plain text is the lowest common denominator

Every Unix configuration file, log, manifest, CSV, and source tree is text. The text-processing tools therefore work uniformly across the entire system — /etc/passwd, nginx.conf, Makefile, journalctl output, and your own CSV exports are all addressable with the same vocabulary. No format converters required.

Streaming is cheap

Because each tool processes one line at a time and writes immediately, the OS pipes the output into the next tool's input without buffering the whole file. Pipelines work on data larger than memory, and the first answers appear before the input has been fully consumed. This is the same architectural property that makes the shell scale from a 10-line .txt to a multi-gigabyte log.

Key takeaways

Mental model

The composition pipeline

The deepest idea is the pipe operator itself. Each program is a node; | wires the previous stdout to the next stdin. The bash shell sets up the kernel pipes, forks the processes, and waits for the final one to exit. You compose by describing the transformation steps in order, left to right, in the same way you would on paper.

The composition pipeline

Slice, dice, edit — the three sub-toolboxes

The verbs cluster naturally into three families. Slicers (cut, paste, join) treat lines as records with columns. Comparers (comm, diff, patch) look at the difference between two streams. Editors (tr, sed) rewrite the content of every line that flows past.

Slice, dice, edit — the three sub-toolboxes

sed as a tiny editing language

sed looks like an alien artifact until you realize it has the same two-part shape as any text editor command: an address (which lines to act on) followed by a command (what to do). 1,5p means "print lines 1 through 5". /SUSE/d means "delete every line matching SUSE". s/old/new/g means "globally replace old with new on the current line." Stack a few of those and you have a one-line script that beats most ad-hoc Python.

Practical application

There are three habits that keep these tools honest in long pipelines. First, always sort before uniquniq only sees adjacent duplicates, so unsorted input quietly retains duplicates. Second, know your delimiter: cut -d':', sort -t':', and join -t':' all need the same -t or -d setting when working on /etc/passwd-style files. Third, use cat -A to inspect whitespace: tabs, trailing spaces, and DOS carriage returns are invisible to the eye but break every column-aware tool.

For ad-hoc work, prefer the smallest tool that does the job. Reach for sed only when tr and cut cannot express the transformation. Reach for awk (covered briefly at the end of Shotts's discussion) when you need arithmetic per line. Reach for a real scripting language only when the pipeline grows past four or five stages or needs control flow.

Example

Suppose you have an access.log file in standard Apache format and you want to know your top ten requested URLs that produced HTTP 404 responses, with a count per URL. The English description has six steps: read the file, keep only the 404 lines, slice out the request URL field, group identical URLs together, count each group, and take the ten largest counts.

Each English clause becomes one pipeline stage:

cat access.log \
  | grep ' 404 ' \
  | cut -d' ' -f7 \
  | sort \
  | uniq -c \
  | sort -nr \
  | head -10

Read top to bottom and translate back:

  • cat access.log — stream the whole log to stdout.
  • grep ' 404 ' — keep only lines whose status code is 404 (the surrounding spaces avoid matching 404 inside a URL or timestamp).
  • cut -d' ' -f7 — fields are space-separated; field seven in the common Apache combined format is the request URL.
  • sort — group identical URLs together so uniq will collapse them.
  • uniq -c — collapse each run of identical lines into one line prefixed by the count.
  • sort -nr — numeric, reverse: largest counts first.
  • head -10 — keep just the top ten.

If the result has noise from query strings (/foo?bar=1 and /foo?bar=2 count separately), add a stage between cut and sort: | sed 's/?.*$//' strips everything from the first ? to end of line. That tiny sed patch keeps the pipeline intact while normalizing the data — exactly the workflow Shotts is pointing at.

A second variation: to write a diff of "URLs we used to see in last week's top ten that are no longer in this week's" you save each week's top ten to a file and run diff lastweek.txt thisweek.txt. Same toolkit, different shape — comparing the outputs of two pipelines is itself another pipeline stage.

Continue exploring

Tags