10. Proposed Minimum Standards for Credible Backtesting

Standards··backtesting, bt-series, standards, checklist

Based on the foregoing analysis, I propose the following minimum standards for a backtest to be considered credible evidence of a potentially viable trading strategy. These standards are not sufficient to guarantee forward profitability, but their absence should be treated as a strong signal that reported results are unreliable.

StandardRequirement Summary
Data qualityDocumented source; multi-vendor cross-check on intraday extremes; archived dataset snapshots; specified continuous contract methodology; equity adjustment treatment (price-only versus total-return) declared explicitly
Execution modelEmpirical spreads; volatility-dependent slippage; limit order queue modelling; explicit handling of limit halts, circuit breakers, and session halts
Transaction costsAll material costs: commissions, spreads, roll costs, financing, market impact; gross and net returns reported side by side
Currency effectsFX on P&L, margin collateral, and interest differentials; FX applied dynamically across the sample (not only at exit); report in base currency
Margin dynamicsTime-varying margin schedule modelled where possible; sensitivity to percentage-margin spikes during stress periods; combined effect of percentage and notional changes acknowledged
Counterparty riskVenue failure base rate acknowledged; haircut applied to long-horizon crypto returns; stablecoin peg exposure flagged
Tax reportingPre-tax and after-tax returns reported side by side; holding-period regime declared; wash-sale equivalents accounted for
Intraday resolutionIntraday data for any strategy using path-dependent orders
Statistical rigourConfidence intervals; multiple-comparison correction (Reality Check, SPA, DSR, PBO); pre-defined OOS period
Regime analysisPerformance by regime; drawdown depth and duration contextualised
Risk managementPosition sizing; risk limits; portfolio interaction analysis
Robustness validationWalk-forward analysis with disclosed window configurations; limitations acknowledged
Capacity assessmentEstimated maximum notional before edge degradation; impact model at scale
Process disciplineDefined pipeline; robustness tests aligned to strategy class

Each standard is expanded below.

  • Data quality: The data source must be documented and its known limitations disclosed. Adjustments for erroneous prints, phantom bars, missing sessions, and corporate actions must be described. Commercially sourced data must not be assumed to be clean; independent spot-checks against exchange records or alternative sources should be performed, particularly for older historical periods. For strategies sensitive to intraday extremes, the same strategy should be re-run against at least one independent vendor’s data to detect pipeline-specific artefacts. Datasets should be archived at the time the backtest is run, because vendors silently rewrite their historical records and re-running months later can produce different results from the same code. For equity strategies, the treatment of dividend and split adjustments (price-only versus total-return) must be stated explicitly, and absolute-level logic must use unadjusted prices. For futures data, the continuous contract construction methodology must be specified, and the historical trading venue regime (pit, mixed, electronic) must be identified and its implications for strategy applicability discussed.

  • Execution model: The backtest must employ a fill model that accounts for bid-ask spread (using empirical spread data where available, not a fixed assumption), slippage as a function of order size and market volatility, realistic order sequencing, and the adverse selection inherent in limit order fills. Strategies using limit orders must model queue position or, at minimum, require price to trade through the limit level by a specified buffer before assuming a fill. The model must also handle limit-locked sessions, circuit breakers, and single-stock trading halts, rather than assuming the trader could have executed at the printed price during any such period.

  • Transaction costs: All material costs must be included: commissions, exchange fees, spread costs, roll costs for futures (including calendar spread crossing costs and roll-window market impact), financing costs for leveraged positions, and an estimate of market impact for strategies intended to operate at scale. Gross and net returns should be reported side by side at every stage of the development pipeline; the gap between them is a direct measure of how much of the apparent edge depends on optimistic cost assumptions.

  • Currency effects: For any strategy trading instruments denominated in a currency other than the account’s base currency, the backtest must model exchange rate effects on trade-level profit and loss, on the value of margin collateral held in foreign currency, and on interest accrual differentials between the domestic and foreign currency. The FX overlay must be applied dynamically across the sample (at every trade, every margin revaluation, and every interest accrual), not as a single conversion at the end. Performance metrics must be reported in the trader’s actual base currency, not in the instrument’s native currency.

  • Margin dynamics: For any leveraged strategy, the backtest should model the time-varying behaviour of margin requirements rather than assume a fixed schedule. The combined effect of percentage-margin changes and notional drift must be acknowledged. Where the actual historical margin schedule is not available, sensitivity to percentage-margin spikes (doubling or tripling during stress periods) should be tested and reported.

  • Counterparty risk: For long-horizon backtests, the implicit assumption that the broker and exchange remained solvent and accessible should be made explicit. For crypto in particular, a haircut reflecting the historical base rate of venue failure should be applied, or the strategy universe should be restricted to venues with genuine segregation and audited reserves. Stablecoin peg exposure should be flagged as a credit and reserve assumption.

  • Tax reporting: Backtests must report pre-tax and after-tax returns side by side, with the holding-period regime declared and the trader’s actual marginal rates plugged in. Wash-sale equivalents and the differential treatment of futures versus securities must be accounted for where relevant. Strategies of very different turnover cannot be compared on pre-tax returns alone.

  • Intraday resolution for path-dependent orders: Any strategy that uses stop-losses, profit targets, trailing stops, or other intraday-sensitive order types must be tested with intraday data of sufficient resolution to determine order fill sequence. Daily OHLC data is insufficient for this purpose regardless of the strategy’s rebalancing frequency.

  • Statistical rigour: Results must be reported with confidence intervals. The number of strategy variants tested must be disclosed, and significance thresholds must be adjusted for multiple comparisons using established methods such as White’s Reality Check, Hansen’s Superior Predictive Ability test, or the Deflated Sharpe Ratio. Where feasible, the Probability of Backtest Overfitting should be estimated via Combinatorially Symmetric Cross-Validation. Out-of-sample results must be reported alongside in-sample results, with the out-of-sample period defined prior to optimisation.

  • Regime analysis: Performance must be decomposed by market regime, with explicit identification of the conditions under which the strategy is expected to underperform. Maximum drawdown and drawdown duration must be reported and contextualised.

  • Risk management and portfolio context: The strategy must be presented with a defined position sizing methodology, explicit risk limits, and an analysis of how it interacts with other portfolio components. Standalone signal evaluation without risk management context is insufficient.

  • Robustness validation: The strategy must be validated using walk-forward analysis with fully disclosed window configurations, fitness functions, and the number of configurations tested. The walk-forward results must be presented as a necessary filter rather than sufficient proof of robustness. If Monte Carlo analysis or synthetic data testing is employed, the assumptions underlying the method must be stated and their applicability to the specific strategy class must be justified. Claims of robustness derived from trade-shuffling Monte Carlo applied to strategies with dynamic position sizing, portfolio-level risk filters, or pattern-dependent entries should be treated with particular scepticism, as described in Section 6 .

  • Capacity assessment: The backtest must include an estimate of the maximum notional the strategy can deploy before its edge is degraded by market impact. This requires, at minimum, comparing typical order sizes to historical volume and depth-of-book data for the traded instruments. A strategy that is profitable at one contract but whose returns collapse at ten contracts has a capacity constraint that must be disclosed, and that is, for practical purposes, the binding constraint on the strategy’s economic value. Market impact is both temporary (price recovers after the order is absorbed) and permanent (the information content of the order shifts the equilibrium price), and strategies with high turnover or concentrated execution windows are disproportionately affected. The absence of any capacity estimate renders a backtest incomplete: the strategy may be valid but untradeable at the scale required to justify the infrastructure investment.

  • Process discipline: The strategy creation and testing process should follow a defined, repeatable pipeline that enforces a scientific and statistically valid methodology. This pipeline should specify the sequence of steps from hypothesis formulation through data preparation, in-sample fitting, out-of-sample validation, and robustness testing, with each step’s criteria defined in advance rather than adjusted after the fact. The robustness tests and evaluation metrics employed must be appropriate to the specific class of strategy being tested and aligned with what the researcher is attempting to prove or disprove. A trend-following strategy, for example, demands different robustness criteria than a mean-reversion strategy, and a strategy intended for a single instrument requires different validation than one designed for a diversified portfolio. The absence of a structured evaluation process is itself a red flag: if the researcher cannot describe the pipeline that produced their results, the results should be treated with corresponding scepticism.