Assess Backtest Risk and Performance Metrics with Pyfolio

4 min read

Core idea

A trading strategy is a probability distribution dressed as an equity curve, and any single number you pull off the curve will mislead you. The Sharpe ratio tells you about volatility but ignores drawdowns. Max drawdown ignores risk-adjusted return. Either viewed alone causes the wrong decision. Pyfolio Reloaded — which sits on top of empyrical-reloaded and consumes Zipline backtest output — exists to make the composite view cheap to produce: return analytics, drawdown analytics, rolling-risk analytics, exposure analytics, and trade-level analytics, all from one backtest pickle and a benchmark.

Author's framing: No single risk or performance metric tells the entire story. The composite view across multiple metrics is what reveals how a strategy actually behaves under different market regimes.

Why it matters

Strategy performance shifts by regime

A strategy that earns a 1.8 Sharpe over a five-year backtest may have a 2.4 Sharpe in the 2017 bull period and a 0.3 Sharpe through the 2018 selloff. Annual-summary statistics hide this. Pyfolio's rolling-window plots (plot_rolling_volatility, plot_rolling_sharpe) and per-period breakdowns (plot_monthly_returns_heatmap, plot_annual_returns) surface the regime sensitivity that single numbers conceal. The live_start_date argument splits metrics into backtest-only and post-deployment buckets — the most honest measure of whether a strategy still works.

Drawdown is the survival metric, not the performance one

Annual return tells you what you might earn. Max drawdown tells you what you might quit at. A 50% drawdown — even on a strategy that ultimately doubles — is psychologically unsurvivable for most operators and structurally unsurvivable for leveraged ones. plot_drawdown_periods, plot_drawdown_underwater, and show_worst_drawdown_periods together produce the answer to "how bad does it get, how long does it stay bad, and how long until it recovers?"

Exposure and sector concentration explain returns

Two strategies can have identical equity curves and radically different risk profiles if one is 200% net long and the other is market-neutral. plot_gross_leverage, plot_exposures, plot_holdings, show_and_plot_top_positions, and plot_sector_allocations decompose the returns into what was held — the portfolio composition that produced them. Sector mapping (built from OpenBB Platform screener data) lets you ask whether your "stock-picking" alpha was secretly a sector bet.

Trade-level analysis reveals the real distribution

Round-trip extraction — pairing each opening transaction with its closing counterpart — turns the time-series of P&L into a distribution of individual bets. extract_round_trips, print_round_trip_stats, and plot_round_trip_lifetimes answer questions strategy-level metrics cannot: What's the win rate? The profit factor? The average holding period by sector? Is the strategy actually 100 mediocre trades or 5 lucky ones?

Key takeaways

Mental model

Mental model

Practical application

The Pyfolio workflow has a fixed shape; only the inputs change between strategies.

  1. Prepare the triplet. Load the Zipline pickle with pd.read_pickle. Call pf.utils.extract_rets_pos_txn_from_zipline(perf) to get returns, positions, and transactions. Replace Equity objects in the transactions DataFrame's symbol column with their string representations using .apply(lambda s: s.symbol).

  2. Acquire the benchmark and sector map. Pull SPY (or another benchmark) historical prices via OpenBB, compute percent changes, localize to UTC, and align to the returns index. Use the OpenBB equity screener (obb.equity.profile) on the position symbols to build a {symbol: sector} dictionary, marking missing sectors as "Unknown".

  3. Run the return analytics. plot_rolling_returns, show_perf_stats (with live_start_date), plot_monthly_returns_heatmap, plot_annual_returns, plot_returns, plot_return_quantiles. Pass live_start_date consistently so the backtest-vs-live split is honest.

  4. Run the drawdown and rolling-risk analytics. plot_drawdown_periods(top=10), plot_drawdown_underwater, show_worst_drawdown_periods. Layer on plot_rolling_volatility and plot_rolling_sharpe to see regime shifts. Use extract_interesting_date_ranges to overlay known stress windows.

  5. Run the exposure analytics. plot_holdings, plot_long_short_holdings, plot_gross_leverage, plot_exposures, show_and_plot_top_positions, plot_sector_allocations (consuming the sector map).

  6. Run the trade-level analytics. Extract round trips from transactions, then print_round_trip_stats and plot_round_trip_lifetimes. Re-run with sector-grouped round trips to see which sectors produced the bets.

Example

Imagine a long-short equity strategy that earns 14% annualized with a Sharpe of 1.2 over a five-year backtest. The numbers are decent. Then you run the Pyfolio tearsheet.

The rolling Sharpe shows the strategy earned 80% of its alpha in 2017 — and was flat-to-negative for the eighteen months that followed. The plot_drawdown_underwater plot reveals a 22% drawdown that took fourteen months to recover. The exposure plot shows gross leverage drifting from 1.5x at the start to 2.3x by year three — return per unit of risk is actually deteriorating. The sector allocation plot shows 38% of the book concentrated in Technology by month 60. Round-trip stats show a 47% win rate but a 1.8 profit factor — the wins are big and rare, the losses small and frequent; that's a "lottery ticket" pattern that breaks down when the lottery stops paying.

The strategy isn't necessarily bad. But the composite view shows it's a Tech-momentum bet that worked in one regime and creeps in leverage. That's a very different decision than "1.2 Sharpe, ship it." The composite analysis turned a number into an explanation.

Continue exploring

Tags