Store Financial Market Data for Easy Retrieval
5 min read
Core idea
The storage decision is a trade-off, not a default
Where do you put market data once you've downloaded it? The cookbook's fourth topic surveys the four most common answers — CSV (with compression), SQLite, PostgreSQL, and HDF5 — and treats them not as competitors but as points in a trade-off space. Each option optimizes one axis: portability, simplicity, concurrency, or raw speed. The competent quant picks the right one for the stage of the pipeline they're in.
The unifying mechanism is pandas.to_* / pandas.read_*. Every format the topic covers has a one-line pandas binding — to_csv, to_sql, HDFStore.put — so the storage layer is plug-and-play from pandas's perspective. The architectural decision is which format, not which API.
Automate the daily refresh once and stop touching it
The second strand of the topic is operational: once you've picked a storage format, wrap it in a script and schedule the script to run automatically at market close. cron on macOS/Linux, Task Scheduler on Windows. The exchange_calendars library answers the "is today a trading day?" question (NYSE was closed yesterday; was that a holiday or the weekend?), so your script doesn't run on Christmas and append the same Friday data three times.
A trading desk's data archive should grow by one row per ticker per trading day, automatically, with zero human intervention. That's the operational target of this topic.
Why it matters
Re-downloading is friction and friction kills iteration
Every minute spent re-pulling data from a vendor API is a minute not spent testing ideas. Free APIs throttle (Nasdaq Data Link: 50 calls/day without a key); paid APIs charge per call. The economic shape of every research project trends the same way: pull once, store, then iterate over the local copy hundreds of times. A storage layer that's well-tuned to your access pattern makes the iteration loop near-instantaneous.
The wrong format reveals itself only at scale
CSV works fine for 10 tickers and three years of history; at 3,000 tickers and twenty years it's painful (read_csv of a 2 GB file). SQLite works fine when one person queries it; under concurrent writes from multiple processes it locks. Postgres handles concurrency but needs a server. HDF5 is the fastest read for numerical arrays but offers no SQL. The mistake is committing to one format early; the discipline is migrating as your needs evolve.
Key takeaways
Mental model
The format trade-off space
Symmetry: every format reads back through pandas
A property worth internalizing: regardless of which storage format you pick, the read API into pandas is one line.
- CSV:
pd.read_csv("data.gz", compression="gzip") - SQLite:
pd.read_sql("SELECT * FROM stock_data", conn) - Postgres: same as SQLite but with a SQLAlchemy engine
- HDF5:
pd.read_hdf("data.h5", key="equities/spy/prices")
This is what makes the trade-off real. Migrating from CSV to SQLite is a one-day exercise; migrating from SQLite to Postgres is one-week-ish; the downstream notebooks don't change at all. Pick the format that fits today's problem; rewrite the storage layer when today's problem changes, not before.
The scheduled-refresh skeleton
Every production pipeline looks roughly the same:
Practical application
Picking your starting format
For a working quant beginning a new research project today, the practical advice is:
- Use SQLite from day one. The setup cost is identical to CSV (just
sqlite3.connect("data.sqlite")instead of opening a file), but you get SQL, transactions, and the ability to runSELECT close FROM bars WHERE symbol = 'AAPL' AND date > '2023-01-01'withoutread_csv-ing the whole file first. - Add a daily refresh script as soon as you're returning to the data twice. The cookbook's
market_data_sqlite.pyskeleton — acceptbulkorlastfromargv, useexchange_calendarsto gate trading-day logic, append viato_sql— is the minimal viable shape. - Schedule it once.
crontab -e(Linux/macOS) or Task Scheduler (Windows). Set it to 23:00 in your local timezone and forget it. - Migrate to Postgres if and when a second person needs concurrent access, or a second machine needs to write into the same store. Not before.
- Migrate to HDF5 or Parquet if and when the read latency of SQLite becomes the bottleneck — typically when you're sweeping thousands of parameter combinations over many years of data.
HDF5 paths as taxonomy
A non-obvious feature of HDF5 worth knowing: paths inside an .h5 file behave like a filesystem. store.put("equities/spy/stock_prices", df) writes the DataFrame at the path /equities/spy/stock_prices, and you can later query just that subtree. This is structural — different asset classes, different symbols, different expirations all fit naturally into the path hierarchy:
futures.h5
/futures/CL/2024-12
/futures/CL/2025-12
/futures/ES/2024-12
...
stocks.h5
/equities/SPY/prices
/equities/SPY/options/chain
/equities/AAPL/prices
...
Combine that with the fact that HDF5 reads numerical arrays roughly an order of magnitude faster than CSV, and you have a storage format ideally suited to "the trader has 200 backtests to run today and each one reads 50 series."
Example
Consider a quant who runs a "calendar-spread on every WTI crude oil expiration" strategy. The data the strategy needs: daily settlement prices for every CL futures contract over the last fifteen years, plus a few macro series (10-year yield, dollar index).
The storage architecture:
- One
.h5file per asset class:futures.h5,macro.h5,etf.h5. - Inside futures.h5, the hierarchy is
/futures/{root}/{YYYY-MM}— one path per expiration. The trader pulls each expiration once from OpenBB, then stores it. Subsequent reads are sub-millisecond. - A nightly cron job runs
refresh.pyat 23:00 EST. The script: (a) checksexchange_calendarsfor the day; (b) pulls the prompt-month contract's bar for the day; (c) appends to/futures/CL/{prompt_month}. - A SQLite database alongside holds metadata: which expirations have been pulled, which dates are complete, when the last refresh succeeded. This makes recovery easy when a network blip skips a day.
The trade-off the trader made: HDF5 for raw price reads (because the backtester sweeps thousands of contract pairs, and HDF5's read speed makes that tolerable); SQLite for metadata (because the trader wants to ask "which contracts am I missing data for?" in SQL). The two formats coexist in the same project; each is doing the job it does best.
Related lessons
Related concepts
- Data Pipelinelinked concept
- Time-Series Datalinked concept
- ETLlinked concept
- Algorithmic Tradinglinked concept