Analyze and Transform Financial Market Data with pandas
6 min read
Core idea
pandas is the lingua franca, and the Index is what makes it work
Wes McKinney built pandas while running quant research at AQR. Every design decision in the library — immutable indexes, label-based alignment, MultiIndex for high-dimensional data, time-aware resampling — was driven by problems quants face daily: aligning prices that arrive at different timestamps, reshaping option chains across strikes and expirations, computing rolling statistics over irregular windows. This topic is the longest in the cookbook because it is the foundation everything else stands on. Get the DataFrame contract right and the rest of the stack reads naturally; get it wrong and every downstream notebook fights you.
The central insight to internalize: pandas operations align on the index, not on row position. When you add two Series, pandas matches them by label. When you join two DataFrames, it joins by index. When a column doesn't exist on both sides, pandas fills with NaN rather than crashing. That label-alignment behaviour is what lets you take prices from different feeds, with different start dates and different missing-data patterns, and combine them without writing any merge code. It is also the source of every bug that looks like "my numbers are off by one row."
Returns, volatility, and resampling — the three primitives
Beyond mechanics, the topic teaches three computational primitives that show up in every trading pipeline:
- Returns — simple (
pct_change) for portfolio aggregation; compound / log (np.log(price / price.shift())) for time-series math. The choice is not stylistic; it determines which math is valid. - Volatility — sample standard deviation of returns, annualized by
√(252)for daily data,√(12)for monthly,√(4)for quarterly. The square-root scaling comes from variance being additive across independent periods, not standard deviation. - Resampling —
df.resample(rule="...")converts between frequencies.ohlc()aggregates intraday quotes into bars;mean()smooths;last()snapshots. The whole pipeline tends to be daily for backtesting, intraday for execution.
These three primitives, applied to a properly indexed DataFrame, cover roughly 80% of the data work in a typical strategy.
Why it matters
Most "model" bugs are actually data-shape bugs
Empirically, the time spent on a strategy splits 70/20/10 between data wrangling, modelling, and execution. Inside the data-wrangling 70%, most of the pain is shape mismatches — a forward-fill that filled too aggressively, a groupby that broke the index, a merge that introduced rows. Understanding the pandas Index types (especially DatetimeIndex, PeriodIndex, and MultiIndex) is the single highest-leverage investment a new quant can make. The topic spends so many pages on them for a reason.
MultiIndex is how options and panel data fit in two dimensions
A DataFrame is two-dimensional, but financial data isn't. Options chains live in (expiration, strike, option_type). Panel data lives in (symbol, date). The MultiIndex is pandas's answer: a hierarchical index that lets a 2D DataFrame represent N-dimensional data, with loc, groupby, unstack, and xs for slicing along any level. Quants who skip MultiIndex end up with thousands of CSV files; quants who learn it end up with one DataFrame.
Key takeaways
Mental model
The Index hierarchy
There are seven pandas index types, but four matter for trading work. The right index choice eliminates whole classes of bugs.
Selection: position vs. label vs. condition
Three families of selection coexist in pandas, and mixing them is the most common source of off-by-one errors:
.iloc— integer position.df.iloc[0]is always the first row, regardless of what its label is. Half-open ranges:df.iloc[0:3]returns rows at positions 0, 1, 2..loc— labels.df.loc["2024-01-02"]returns the row whose Index value is the timestamp2024-01-02. Closed ranges:df.loc["2024-01-02":"2024-01-11"]includes both endpoints.- Boolean indexing —
df[df.close > df.close.mean()]. Returns the rows where the condition is True.
The closed-vs-half-open distinction (loc closed, iloc half-open) trips up everyone exactly once. After that, you remember.
Returns: simple, log, and what each is for
The cookbook spends real ink distinguishing simple returns from log returns. The framing that sticks:
- Simple returns (
R = P_t / P_{t-1} - 1) add across assets, multiply across time. A portfolio's simple return is the weighted sum of its constituents' simple returns. A two-period simple return is(1 + R_1)(1 + R_2) - 1. - Log returns (
r = ln(P_t / P_{t-1})) add across time, do not aggregate cleanly across assets. A two-period log return isr_1 + r_2. But the log return of a portfolio is NOT the weighted sum of constituent log returns.
If you are computing portfolio P&L, use simple returns. If you are computing time-series statistics or fitting a model that assumes normality, use log returns. Many quant bugs come from picking one and using it for both jobs.
Volatility and the √-time rule
Volatility is the standard deviation of returns. Daily volatility annualizes to yearly via σ_daily × √252, not σ_daily × 252. Why? Because variance (not standard deviation) is additive across independent periods. Yearly variance = daily variance × 252; take the square root to get back to standard deviation. Same logic gives √12 for monthly and √4 for quarterly.
In practice you almost always want the rolling annualized volatility, not the lifetime number — markets regime-shift and a single number across decades hides everything interesting. The idiom is:
returns.rolling(window=21).std() * np.sqrt(252)
A 21-day window approximates a calendar month of trading days; tune to your strategy's holding period.
Resampling: closed and label are not symmetric
df.resample(rule="1H") looks innocuous and is the source of a surprising number of bugs. Two arguments matter:
closed— which side of the bin interval is inclusive. For daily bars from minute quotes,closed="left"means the 09:30 bar covers[09:30, 09:31). For most market data, you wantclosed="left".label— which side's timestamp labels the bin.label="left"labels with 09:30;label="right"labels with 09:31.
The defaults differ by rule (some are left-closed-right-labelled, some are right-closed-left-labelled). Always set both explicitly. The cookbook's ohlc() aggregator is the canonical example: it composes first / max / min / last into one bar.
Practical application
A typical analysis pipeline
For a quant taking raw prices and producing a tradable signal, the topic's primitives compose roughly as follows:
- Acquire:
obb.equity.price.historical(tickers, ...)produces a long-format DataFrame. - Reshape:
.pivot(columns="symbol", values="close")turns it into a wide DataFrame with one column per ticker and aDatetimeIndex. - Clean: forward-fill if appropriate (
fill_method="ffill"); drop rows where any required ticker is missing (.dropna()). - Returns:
daily = wide.pct_change()for portfolio math;log_returns = np.log(wide / wide.shift())for statistical work. - Roll:
rolling_vol = daily.rolling(21).std() * np.sqrt(252)for regime detection. - Aggregate:
daily.resample("ME").apply(lambda r: (1 + r).prod() - 1)for monthly returns suitable for a Fama-French regression. - Slice:
daily.loc["2020":, ["AAPL", "MSFT"]]for a specific window and subset.
Every step preserves the DatetimeIndex and tickers as columns, so the entire pipeline composes as method chains.
MultiIndex as the panel data convention
Whenever you work with more than one ticker over time, the MultiIndex of (symbol, date) becomes the dominant idiom. The pattern:
- Construction:
df.set_index(["symbol", "date"]). Order matters; the outer level controls howgroupbyslices the data. - Per-asset operations:
df.groupby(level="symbol").pct_change()computes returns within each symbol's series. Without the groupby,pct_changecrosses symbol boundaries and you get spurious returns at every ticker transition. - Cross-section at a date:
df.xs("2024-01-02", level="date")returns the cross-section of every symbol on that date. - Time series for one symbol:
df.xs("AAPL", level="symbol")returns AAPL's whole series.
Once this vocabulary is fluent, panel data feels native. Until then, it feels like fighting pandas.
Example
Consider a quant building a "sector rotation" signal: rank the 11 GICS sector ETFs by their 21-day return on the first trading day of each month, hold the top 3 equally weighted for the month, repeat.
The data layer for that strategy lives entirely in the primitives this topic teaches:
- Pull the 11 sector ETFs (
XLK,XLV,XLF,XLE,XLY,XLP,XLI,XLU,XLB,XLRE,XLC) via OpenBB into one long DataFrame indexed by(symbol, date). - Compute trailing 21-day returns per symbol:
groupby(level="symbol").close.pct_change(21). - Resample to month-end:
.resample("ME", level="date").last()(after pivoting to wide), which snapshots the 21-day return at each month-end for every sector. - Within each month, rank:
monthly_returns.rank(axis=1, ascending=False). - Mask to the top 3:
(ranks <= 3).astype(int)gives a binary "in-portfolio" matrix. - Compute portfolio returns: shift the mask forward by one period (to avoid look-ahead), multiply by next-month returns, divide by 3.
The whole pipeline is about 15 lines. Every line is a pandas idiom from this topic. The strategy may not work — most don't — but the cost of testing it is now small enough that the trader can move on to the next hypothesis instead of debugging the data layer.
Related lessons
Related concepts
- Time-Series Datalinked concept
- Dataframelinked concept
- Returnslinked concept
- Volatilitylinked concept
- Quantitative Financelinked concept