Regular Expressions

Definition

A regular expression is a compact pattern that describes a set of strings. Plain characters in the pattern match themselves; metacharacters describe structure. The dot . matches any single character. A character class [abc] matches any one of the listed characters; [a-z] matches any lowercase letter; [^abc] matches any character not in the class. Quantifiers *, +, and ? match the preceding element zero-or-more, one-or-more, and zero-or-one times respectively. Anchors ^ and $ match the start and end of a line. Parentheses group; alternation | separates alternatives.

Regular expressions originate in the formal language theory of Kleene (1956) and are the pattern language consumed by the entire Unix text-processing toolchain — grep, sed, awk, egrep, vim, less, find -regex — and by the standard libraries of nearly every modern programming language. Two main dialects exist in practice: POSIX (basic and extended) and PCRE (Perl-compatible), with small but important differences in metacharacter syntax and feature support.

Why it matters

How it works

A regular-expression engine compiles the pattern into a state machine and then runs the machine against the input. The classical Thompson NFA construction (used by awk and modern grep variants) takes time linear in the length of the input regardless of pattern complexity. The backtracking implementation (used by PCRE, Java, JavaScript) is more expressive — it supports backreferences and lookaround — but can exhibit pathological exponential blowup on certain patterns against adversarial input, a class of bugs known as ReDoS.

In day-to-day Unix work, regular expressions flow through pipelines via the shell. The shell expands its own globs first, so regex patterns nearly always need to be single-quoted to pass through unchanged: grep '^[A-Z][a-z]*$' words.txt matches lines containing only a single capitalized word. Anchored patterns (with ^ and $) are faster and produce fewer surprises than unanchored ones. Capture groups ($...$ in POSIX BRE, (...) in ERE) let sed substitutions reuse matched portions: sed -E 's/([0-9]+)-([0-9]+)/\2-\1/' swaps two numeric groups around a hyphen.

Regular Expressions

Definition

Why it matters

How it works

Where it goes next

Continue exploring

Tags

Regular Expressions

Definition

Why it matters

How it works

Where it goes next

Related concepts

Continue exploring

Tags