Regular Expressions

4 min read

Core idea

A regular expression is a small string of symbols that describes a set of strings. Where shell wildcards say "this filename pattern," regex says "this textual shape" — and applies the same vocabulary to any stream a program can read: log lines, dictionary entries, source code, network captures. The notation is older than Unix and survived because it composes: a handful of metacharacters (anchors, character classes, alternation, quantifiers, grouping) recombine into matchers for almost any text problem. Every command-line text tool that takes a search pattern speaks some dialect of it.

Shotts's argument: Regex looks arcane only until you internalize that it has exactly five ideas — match a literal, match "any of these," match "either this or that," repeat, and anchor to position. The rest is syntax.

Why it matters

One notation, many tools

Once you can read regex, every text tool on the system gets more powerful at the same time. grep is the obvious one, but sed, awk, find -regex, vim/less searches, log shippers, editor find-and-replace, and most programming languages all accept the same alphabet (with small dialect variations). The skill compounds across decades of tools.

Pattern thinking transfers

Decomposing a vague request ("find all lines that look like a phone number") into a small set of shape rules is a skill regex forces you to practice. That same decomposition shows up in form validation, data scrubbing, log triage, and parsing — places where you would otherwise reach for ad-hoc string slicing that breaks on the first edge case.

POSIX is the lingua franca

Regex dialects differ — Perl's PCRE, JavaScript's flavor, Python's re, and POSIX BRE/ERE all diverge in small but painful ways. Learning the POSIX subset first gives you a portable mental baseline; you then add dialect-specific features (lookahead, named groups, non-greedy quantifiers) only when you need them.

Key takeaways

Mental model

The five building blocks

Almost every regex you will ever read decomposes into the same five categories. Once you can name each piece on sight, an "ugly" expression like ^\([0-9]\{3\}\)[ -]?[0-9]\{3\}-[0-9]\{4\}$ becomes legible: anchors at the edges, classes in the middle, quantifiers controlling repetition, grouping for scope, and literals filling in the gaps.

The five building blocks

BRE versus ERE in one picture

POSIX deliberately split regex into two tiers. Basic Regular Expressions (BRE) keep the early-Unix conservative defaults: ?, +, {, }, (, ), and | are literal characters unless escaped with a backslash. Extended Regular Expressions (ERE) flip the defaults: those characters are metacharacters by default, and a backslash makes them literal. Tools differ in which they accept.

BRE versus ERE in one picture

Practical application

When you build a regex on the command line, three habits prevent most pain. First, quote the whole expression in single quotes so the shell does not expand *, ?, [, ], or |. Second, prefer grep -E (ERE) for any expression that uses groups, alternation, or + — the backslash soup of BRE quickly becomes unreadable. Third, prefer POSIX character classes ([[:digit:]], [[:space:]]) over raw ranges like [0-9] when the input might cross locales, because [A-Z] in a non-C locale can include lowercase letters in collation order.

For exploration, pipe small synthetic inputs into the tool rather than running over a large file. echo and a heredoc let you build a tiny corpus that exercises both the cases you want to match and the cases you want to reject — that's how you stop a regex from quietly accepting garbage.

Example

Imagine you need to validate North American phone numbers in a phones.txt file. Acceptable shapes are (555) 555-1212 and 555 555-1212. Anything else — letters, extra spaces, missing parens on one side, wrong digit count — must be flagged.

Walk through the rules in English first:

  • Optional opening paren, then exactly three digits, then optional closing paren and exactly one space — that's the area code.
  • Exactly three digits, then a hyphen, then exactly four digits — that's the local number.
  • The whole expression must match the entire line, so anchor with ^ and $.

Translate piece by piece using ERE (because we want ? and {} without backslash noise):

  • Optional opening paren: \(?
  • Three digits: [0-9]{3} or, equivalently and locale-safer, [[:digit:]]{3}
  • Optional closing paren plus one space: \)?
  • Three more digits, a hyphen, four digits: [0-9]{3}-[0-9]{4}
  • Anchored: ^...$

Concatenated:

^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$

Used with grep, the same expression solves two complementary problems. To show valid numbers, list lines that match:

grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$' phones.txt

To audit invalid numbers, invert the match with -v:

grep -E -v '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$' phones.txt

The pattern is intentionally a little loose — it accepts a closing paren without an opening one. Tightening that requires grouping: ^(\([0-9]{3}\) |[0-9]{3} )[0-9]{3}-[0-9]{4}$. The discipline of starting loose, then tightening with grouping and alternation is exactly how real regex code grows in production scripts and log pipelines.

Continue exploring

Tags