Store Financial Market Data for Easy Retrieval

5 min read

Core idea

The storage decision is a trade-off, not a default

Where do you put market data once you've downloaded it? The cookbook's fourth topic surveys the four most common answers — CSV (with compression), SQLite, PostgreSQL, and HDF5 — and treats them not as competitors but as points in a trade-off space. Each option optimizes one axis: portability, simplicity, concurrency, or raw speed. The competent quant picks the right one for the stage of the pipeline they're in.

The unifying mechanism is pandas.to_* / pandas.read_*. Every format the topic covers has a one-line pandas binding — to_csv, to_sql, HDFStore.put — so the storage layer is plug-and-play from pandas's perspective. The architectural decision is which format, not which API.

Automate the daily refresh once and stop touching it

The second strand of the topic is operational: once you've picked a storage format, wrap it in a script and schedule the script to run automatically at market close. cron on macOS/Linux, Task Scheduler on Windows. The exchange_calendars library answers the "is today a trading day?" question (NYSE was closed yesterday; was that a holiday or the weekend?), so your script doesn't run on Christmas and append the same Friday data three times.

A trading desk's data archive should grow by one row per ticker per trading day, automatically, with zero human intervention. That's the operational target of this topic.

Why it matters

Re-downloading is friction and friction kills iteration

Every minute spent re-pulling data from a vendor API is a minute not spent testing ideas. Free APIs throttle (Nasdaq Data Link: 50 calls/day without a key); paid APIs charge per call. The economic shape of every research project trends the same way: pull once, store, then iterate over the local copy hundreds of times. A storage layer that's well-tuned to your access pattern makes the iteration loop near-instantaneous.

The wrong format reveals itself only at scale

CSV works fine for 10 tickers and three years of history; at 3,000 tickers and twenty years it's painful (read_csv of a 2 GB file). SQLite works fine when one person queries it; under concurrent writes from multiple processes it locks. Postgres handles concurrency but needs a server. HDF5 is the fastest read for numerical arrays but offers no SQL. The mistake is committing to one format early; the discipline is migrating as your needs evolve.

Key takeaways

Mental model

The format trade-off space

The format trade-off space

Symmetry: every format reads back through pandas

A property worth internalizing: regardless of which storage format you pick, the read API into pandas is one line.

  • CSV: pd.read_csv("data.gz", compression="gzip")
  • SQLite: pd.read_sql("SELECT * FROM stock_data", conn)
  • Postgres: same as SQLite but with a SQLAlchemy engine
  • HDF5: pd.read_hdf("data.h5", key="equities/spy/prices")

This is what makes the trade-off real. Migrating from CSV to SQLite is a one-day exercise; migrating from SQLite to Postgres is one-week-ish; the downstream notebooks don't change at all. Pick the format that fits today's problem; rewrite the storage layer when today's problem changes, not before.

The scheduled-refresh skeleton

Every production pipeline looks roughly the same:

The scheduled-refresh skeleton

Practical application

Picking your starting format

For a working quant beginning a new research project today, the practical advice is:

  1. Use SQLite from day one. The setup cost is identical to CSV (just sqlite3.connect("data.sqlite") instead of opening a file), but you get SQL, transactions, and the ability to run SELECT close FROM bars WHERE symbol = 'AAPL' AND date > '2023-01-01' without read_csv-ing the whole file first.
  2. Add a daily refresh script as soon as you're returning to the data twice. The cookbook's market_data_sqlite.py skeleton — accept bulk or last from argv, use exchange_calendars to gate trading-day logic, append via to_sql — is the minimal viable shape.
  3. Schedule it once. crontab -e (Linux/macOS) or Task Scheduler (Windows). Set it to 23:00 in your local timezone and forget it.
  4. Migrate to Postgres if and when a second person needs concurrent access, or a second machine needs to write into the same store. Not before.
  5. Migrate to HDF5 or Parquet if and when the read latency of SQLite becomes the bottleneck — typically when you're sweeping thousands of parameter combinations over many years of data.

HDF5 paths as taxonomy

A non-obvious feature of HDF5 worth knowing: paths inside an .h5 file behave like a filesystem. store.put("equities/spy/stock_prices", df) writes the DataFrame at the path /equities/spy/stock_prices, and you can later query just that subtree. This is structural — different asset classes, different symbols, different expirations all fit naturally into the path hierarchy:

futures.h5
  /futures/CL/2024-12
  /futures/CL/2025-12
  /futures/ES/2024-12
  ...
stocks.h5
  /equities/SPY/prices
  /equities/SPY/options/chain
  /equities/AAPL/prices
  ...

Combine that with the fact that HDF5 reads numerical arrays roughly an order of magnitude faster than CSV, and you have a storage format ideally suited to "the trader has 200 backtests to run today and each one reads 50 series."

Example

Consider a quant who runs a "calendar-spread on every WTI crude oil expiration" strategy. The data the strategy needs: daily settlement prices for every CL futures contract over the last fifteen years, plus a few macro series (10-year yield, dollar index).

The storage architecture:

  1. One .h5 file per asset class: futures.h5, macro.h5, etf.h5.
  2. Inside futures.h5, the hierarchy is /futures/{root}/{YYYY-MM} — one path per expiration. The trader pulls each expiration once from OpenBB, then stores it. Subsequent reads are sub-millisecond.
  3. A nightly cron job runs refresh.py at 23:00 EST. The script: (a) checks exchange_calendars for the day; (b) pulls the prompt-month contract's bar for the day; (c) appends to /futures/CL/{prompt_month}.
  4. A SQLite database alongside holds metadata: which expirations have been pulled, which dates are complete, when the last refresh succeeded. This makes recovery easy when a network blip skips a day.

The trade-off the trader made: HDF5 for raw price reads (because the backtester sweeps thousands of contract pairs, and HDF5's read speed makes that tolerable); SQLite for metadata (because the trader wants to ask "which contracts am I missing data for?" in SQL). The two formats coexist in the same project; each is doing the job it does best.

Continue exploring

Tags