11. Conclusion

Conclusion··backtesting, bt-series, summary

The current state of retail backtesting practice is characterised by a significant gap between the apparent sophistication of the tools and the actual rigour of the analysis they produce. The Python ecosystem has made it easy to generate a backtested equity curve, but the resulting curves are misleading.

The errors are not random. They are systematically biased toward overstating performance. Naive fill assumptions that ignore adverse selection and limit halts, understated transaction costs including roll friction and the dynamic behaviour of margin, contaminated data that diverges between vendors and is silently rewritten over time, the failure to model exchange rate effects on cross-currency positions, the implicit assumption that the broker and exchange are always there, and the fundamental inability of coarse bar data to resolve intraday order execution collectively produce results that are more optimistic than achievable reality.

There is also a class of failure that operates on the reporting side of the backtest rather than the simulation side. Pre-tax returns can overstate the after-tax outcome by 30% or more for high-turnover strategies in a high marginal bracket. A nominal-currency equity curve says little about the home-currency balance the trader actually holds. Counterparty risk is silently assumed away by every backtest that does not apply a haircut for the historical base rate of venue failure. These gaps do not require a more sophisticated simulation engine to address. They require the researcher to acknowledge that the backtest output is a different object from the live trader’s experience, and to translate from one to the other.

The bar resolution problem deserves particular emphasis because it is so widely underestimated. The intuition that longer-term strategies are immune to intraday data requirements is not entirely wrong (strategies can be designed to work within the constraints of daily data) but it is seriously incomplete. Any strategy that interacts with the market through price-contingent orders, which includes virtually all strategies that employ risk management, is subject to the OHLC sequence ambiguity and the stop-loss concealment problem described in Section 4 . The severity of these problems scales directly with bar size: negligible at one-minute resolution, modest at hourly resolution, and potentially disqualifying at the daily level. Daily data for such strategies does not merely introduce noise; it introduces a directional bias that systematically flatters performance. The trader who uses daily data responsibly must design their strategy to avoid dependence on intra-bar order resolution, and must acknowledge the resulting limitations in their performance claims.

Layered atop these mechanical problems are statistical methodology failures that afflict even technically correct backtests: insufficient sample sizes, uncontrolled multiple testing, and the failure to account for regime dependence. The survivorship bias inherent in selecting apparently successful strategies from a large universe of tested variants (whether that universe is generated deliberately through combinatorial search or implicitly through iterative refinement) further inflates reported performance. The combination of mechanical and statistical errors means that a backtest must clear a very high bar before it constitutes meaningful evidence of a tradable edge.

The robustness testing methods commonly applied to address these concerns (Monte Carlo trade shuffling, synthetic data generation, and walk-forward analysis) provide less protection than is generally assumed. Monte Carlo trade shuffling violates the independence assumption for any strategy with path-dependent position sizing or portfolio-level risk filters, producing confidence intervals derived from trade sequences that could never have occurred. Synthetic data methods either destroy the statistical properties the strategy was designed to exploit or test generalisation within the same distribution rather than across genuinely novel conditions. Walk-forward analysis, while the most methodologically sound of the three and an essential minimum standard, is vulnerable to meta-overfitting and can only validate against conditions present in the historical record. The trader who treats a successful walk-forward pass as conclusive evidence of robustness has confused a necessary condition with a sufficient one.

Beyond the technical deficiencies, the ecosystem surrounding retail backtesting contributes to poor outcomes in ways that are less obvious but equally consequential. The self-reinforcing dominance of Python channels traders toward a particular set of tools and conventions that, while accessible, encourage methodological homogeneity and obscure the disciplines of risk management and portfolio construction that are essential to converting a trading signal into a viable system. The novice trader who emerges from this ecosystem is equipped with the ability to generate impressive-looking equity curves but is critically underprepared for the realities of live trading, including the psychological challenge of enduring drawdowns that are almost certainly deeper than those shown in the backtest.

I am not arguing that Python (or any other language) is without value for backtesting. Python remains an excellent tool for exploratory research and the preliminary stages of strategy development. Nor am I arguing that daily bar data is inherently unusable; strategies can be designed around the constraints of coarse bars, provided those constraints are understood and respected. But I do believe that easy access to a programming language and its trading libraries is sending many aspiring traders down a disappointing path, because it encourages them to believe that the programming is the hard part and the trading will follow naturally.

The reality is the reverse.

Programming is the accessible part. The trading competencies that determine success (risk management, statistical literacy, cost awareness, psychological discipline, and the judgment to distinguish a genuine edge from a statistical artefact) are far harder to acquire.

A carefully constructed simulation, built on high-quality data with realistic execution modelling and rigorous statistical methodology, remains an essential tool for strategy development.

But, the distance between a credible backtest and what is commonly produced is enormous. The ease with which the latter is generated has created a false sense of confidence that is, in aggregate, likely destroying more capital than it creates.