Definition
A checksum is a short, fixed-length value computed deterministically from a piece of data, such that any change to the data produces a different value. The function that computes the checksum is designed so that the output is uniformly distributed across its range and so that small input changes produce large output changes — the so-called avalanche property. The result is a compact fingerprint of arbitrary data that can be compared cheaply.
Checksum algorithms range from the fast and weak — CRC32, used to catch accidental corruption in network packets and ZIP archives — to the cryptographically strong, like the SHA-2 family, designed to resist deliberate tampering. MD5 and SHA-1 occupy a middle ground: still useful for integrity checks against accidental corruption, but no longer trusted against adversaries who can engineer collisions.
Why it matters
How it works
The underlying mathematics varies by algorithm but the structural pattern is shared. The function reads its input in fixed-size blocks, maintains an internal state of bounded size, and updates that state with each block in a way that mixes every input bit through the entire state. After the last block, the state is finalised — sometimes with additional padding and a length encoding — and emitted as the checksum. Because the function is deterministic, the same input always produces the same output; because the function is well-mixed, two near-identical inputs almost always produce very different outputs.
The limits depend on the use case. Non-cryptographic checksums like CRC32 are not designed to resist attack: someone trying to forge a file can deliberately construct inputs that hash to a target value. Cryptographic hashes resist that attack — finding a preimage or collision is computationally infeasible for SHA-256 — but they are slower and produce longer outputs. The general rule is to match the algorithm to the threat model: CRC and Adler for transport integrity, MD5 or SHA-1 for cheap-but-strong file fingerprints in trusted contexts, SHA-256 or stronger anywhere an attacker might try to substitute content.