Concept

Text Processing

Definition

Text processing is the family of Unix tools and patterns for transforming streams of plain text. Each tool does one small job — cat concatenates, sort orders lines, uniq collapses adjacent duplicates, cut selects columns, paste merges files side by side, tr translates characters, sed edits with regular expressions, awk performs field-based transformations — and the shell composes them into pipelines that solve larger problems no single tool was designed for.

The toolkit dates to the earliest Unix systems and remains the fastest path from a messy log file or CSV to a useful summary. Modern alternatives exist — pandas in Python, dplyr in R, full SQL engines — but the classic chain still wins for ad-hoc work where the data is plain text and the goal is exploratory rather than long-lived.

Why it matters

How it works

The unit of work is a line of text. Most tools read lines from standard input, transform each one or compute a result over the whole stream, and write the output to standard output. cut slices columns out of each line; tr substitutes one set of characters for another; sort buffers everything and emits it in order; uniq scans the sorted stream and collapses adjacent duplicates, often with a count prefix. The combination — sort then uniq -c then sort -rn — is the canonical way to compute a top-N frequency table from any stream of repeated tokens.

The two largest tools in the family are sed and awk. sed reads a script of substitution and edit commands and applies them to each line, making it the right tool for context-free regex transformations. awk reads a script written in its own small programming language; each input line is automatically split into fields, the script tests the line against pattern conditions, and matching lines trigger arbitrary computations. awk is in practice a one-pass mini-database engine for tabular text, and a script of a few lines can do what a SQL group-by would do in a real database. Reaching for the right tool — or the right combination — is what turns five lines of code into a one-liner.

Where it goes next

Continue exploring

Tags