Concept

Tick Storage

Definition

Tick storage is the family of data-engineering patterns used to persist and query market data at its finest granularity — every quote update, every trade print, every order-book change. The volume is enormous: a single liquid US equity can produce millions of events per day, and a multi-asset research dataset crosses into petabyte territory quickly. Efficient storage is therefore not an optimisation — it is the difference between a working dataset and an unworkable one.

The standard pattern is columnar, append-only, partitioned by date and instrument, and compressed with a codec tuned for the predictable patterns of tick data. Parquet and Arrow are the modern defaults; older deployments rely on HDF5 or kdb+.

Why it matters

How it works

Three design choices do most of the work. First, columnar storage — every column lives in its own contiguous block on disk, so scans touch only the columns the query asked for and compression operates on uniform value types. Tick data has strong column-level patterns: prices stay within tight ranges; timestamps are monotone; venue codes have low cardinality. Codecs exploit each pattern with delta encoding, dictionary encoding, and run-length encoding stacked beneath a general compressor.

Second, partitioning by a coarse time bucket and the instrument identifier. A query for one symbol on one day reads exactly one partition; a multi-symbol study still benefits because the engine can plan parallel reads across files. Third, the store is treated as append-only and immutable. Late-arriving corrections are written as new records rather than in-place edits, with a sequence number or arrival timestamp letting downstream readers reconstruct the canonical view at any point in time. This preserves an audit trail and makes reproducible research possible — a backtest run last quarter can be reproduced exactly because the underlying data has not been mutated.

Where it goes next

Continue exploring

Tags