5. Statistical Methodology Failures

Statistics·February 5, 2026·backtesting, bt-series, statistics, multiple-testing, sharpe

5.1 The Multiple Comparisons Problem

Harvey, Liu, and Zhu (2016) demonstrated that the threshold for statistical significance in backtested strategies must be adjusted for the number of strategies tested. Their work, which examined the factor zoo in academic finance, showed that a $t$ -statistic of 2.0 (the traditional threshold for significance at the 5% level) is statistically inadequate when hundreds or thousands of strategy variants have been evaluated. They proposed a minimum $t$ -statistic of approximately 3.0 for newly discovered factors, accounting for the implicit multiple testing that pervades the field.

This problem is even more acute in retail backtesting, where the number of variations tested is typically far greater than in academic research and almost never reported. A researcher who tests a moving average crossover strategy with twenty combinations of fast and slow periods, across ten instruments, with three different filters, has evaluated six hundred strategy variants. Presenting the best performer as “the strategy” without adjustment for multiple testing is meaningless. Yet this is standard practice.

Bailey, Borwein, Lopez de Prado, and Zhu (2014) formalised this intuition with what they called the false strategy theorem. The expected maximum Sharpe ratio among $N$ independent zero-skill strategies grows as approximately $\sqrt{2 \ln N}$ , which means that trying just ten configurations of a worthless strategy is expected to produce one with a Sharpe ratio above 1.5 in-sample, while the true out-of-sample expectation remains zero. They derived a corresponding minimum backtest length (MinBTL): with five years of daily data, no more than forty-five independent configurations should be tested before the expected maximum Sharpe ratio from pure noise reaches 1.0. With only two years of data, the budget drops to seven. Most retail traders blow past these thresholds before lunch on the first day of development.

The false strategy theorem describes a search across configurations of a zero-skill process: the inflated Sharpe ratio is pure noise, and the mitigation is the multiple-testing correction the theorem prescribes. The position is materially worse for any strategy whose payoff is sensitive to intra-bar order sequencing. As Section 4.2 sets out, the OHLC path assumption generates a per-trade error that is structurally one-signed in the strategy’s favour, and the per-trade bias grows with the tightness of the bracket relative to typical bar range. A parameter search then preferentially selects configurations with tight stops and tight targets, which is exactly the region of parameter space where the bias is largest. The search is no longer data-mining unbiased noise; it is data-mining a systematically positive artefact. The Bailey and López de Prado correction adjusts for the search; it does not remove a bias that was already present before the search began. For path-sensitive strategies on coarse bar data, the two effects multiply, and no multiple-testing correction alone can recover honest performance metrics from the contaminated simulator.

Wiecki, Campbell, Lent, and Stauth (2016) provided striking empirical confirmation of this problem. Using a dataset of 888 algorithmic trading strategies developed on the Quantopian platform, each with at least six months of out-of-sample performance, they found that commonly reported backtest metrics such as the Sharpe ratio offered almost no predictive value for out-of-sample results ( $R^2 < 0.025$ ). More pointedly, they found a statistically significant positive relationship between the amount of backtesting a researcher performed on a strategy and the magnitude of the discrepancy between in-sample and out-of-sample performance: the more a strategy was tested and refined, the worse it performed in live trading relative to its backtest.

This is the multiple comparisons problem made empirically visible at scale. Each iteration of parameter adjustment constitutes an implicit additional test, inflating in-sample performance while degrading the strategy’s generalisability to unseen data.

The natural response to this evidence is to ask: what can be done? The academic literature offers a family of methods specifically designed to correct for multiple testing in strategy evaluation, yet these tools are almost unknown among retail back-testers. White’s Reality Check (2000) uses bootstrap resampling to test whether the best-performing strategy from a set of candidates is genuinely superior to a benchmark after accounting for the number of alternatives tested. Hansen (2005) refined this into the Superior Predictive Ability (SPA) test, which is more powerful against specific alternatives and has been extended by Romano and Wolf into stepwise procedures that identify which strategies in a set retain significance after correction, not merely whether any of them do. Bailey and Lopez de Prado (2017) proposed the Probability of Backtest Overfitting (PBO), estimated via Combinatorially Symmetric Cross-Validation (CSCV): the data is partitioned into multiple subsets, strategy parameters are fitted on every possible training combination, and the frequency with which the best in-sample configuration underperforms out-of-sample provides a direct estimate of the probability that the backtest is overfit. Their related work on the Deflated Sharpe Ratio (2014) adjusts the reported Sharpe ratio for the number of trials conducted, non-normality of returns, and sample length, producing a statistic that is far more informative than the raw Sharpe about whether the observed performance is distinguishable from chance. The common thread is that the number of strategies tested must be treated as a parameter of the evaluation, not an incidental detail to be omitted from the report. Any researcher who has tested more than a handful of variants and does not apply some form of multiple-testing correction is, in effect, reporting the expected maximum of a set of random draws (exactly the quantity the false strategy theorem estimates) and presenting it as an expected value.

For traders who will not invest the effort to implement the formal corrections above, a useful back-of-envelope heuristic has been proposed in the practitioner literature: discount the reported expected return by a factor of $1 - 0.95^N$ , where $N$ is the number of strategy variants tested.¹ The heuristic is crude. It is also vastly better than the standard retail practice of testing 50 variants and reporting the equity curve of the best.

5.2 Insufficient Independent Observations

A ten-year daily backtest of a monthly rebalancing strategy produces only one hundred and twenty observations, and fewer if the strategy is not always in the market. The number of truly independent observations may be smaller still if signals are serially correlated, as they often are in trend-following and momentum strategies where a single trend can generate a cluster of correlated trades.

The statistical power of a test with one hundred and twenty observations to detect a real but modest edge (say, a Sharpe ratio of 0.5) is discouragingly low. The confidence intervals around estimated performance metrics are wide, and the probability of a truly profitable strategy appearing unprofitable (or vice versa) in a single sample is substantial. Yet backtests are routinely presented with precision (“Sharpe ratio: 1.47”) that implies a level of certainty the data cannot support. Two decimal places do not make a number reliable. Bailey and Lopez de Prado (2014) have argued persuasively that the Sharpe ratio, as commonly estimated from backtest data, is a deeply unreliable measure of forward-looking performance.

5.3 Regime Dependence and Non-Stationarity

Financial markets are non-stationary systems. The statistical properties of returns (their mean, variance, autocorrelation structure, and tail behaviour) change over time as market structure, participant composition, regulatory frameworks, and macroeconomic conditions evolve. A strategy optimised on a period dominated by a particular regime (low-volatility trending markets, for example) may perform entirely differently in an alternative regime (choppy, mean-reverting markets with elevated volatility). A backtest that spans multiple regimes may show acceptable aggregate performance while concealing extended periods of severe underperformance. The aggregate hides the pain.

The failure to account for regime dependence is related to the broader problem of overfitting. A strategy with sufficient free parameters can be fitted to any historical data set, including one containing multiple regimes, but the fitted model captures the specific sequence of regimes in the sample rather than any stable underlying relationship. Walk-forward analysis and out-of-sample testing can mitigate this problem but do not eliminate it, particularly when the out-of-sample period is short relative to the regime cycle. Worse, these techniques can only validate a strategy against conditions that have already occurred in the historical record; they offer no protection against structural changes or unprecedented market events. The limitations of these widely trusted validation methods are examined in the following section.

Cryptocurrency markets illustrate regime dependence in an especially stark form. Regulatory changes in crypto can be abrupt and binary: instruments declared securities retroactively, exchanges banned from entire jurisdictions overnight, leverage limits imposed with little warning. A crypto backtest spanning 2019 to 2025 assumes a degree of regulatory continuity that simply did not exist during that period. More broadly, crypto amplifies virtually every failure mode discussed in this section. Regime shifts are faster, leverage is higher, liquidity cliffs are more extreme, and the institutional structure of the market itself is less stable than in any traditional asset class. If the arguments in this paper apply to regulated futures and equities, they apply to crypto with considerably greater force.

5.4 The Parameter Plateau Illusion

A widely taught principle of strategy optimisation is that robust strategies should exhibit “stable regions” or “plateaux” in their parameter space: zones where performance degrades only gradually as parameters are varied, in contrast to narrow spikes where a single parameter value produces good results but neighbouring values do not. The intuition is sound: a strategy whose performance is insensitive to small parameter changes is more likely to generalise than one whose performance depends on a precise setting. However, the standard method of identifying these plateaux contains a subtle mathematical artefact that is almost never discussed.

Consider a moving average crossover strategy optimised over a lookback range of 10 to 200 bars in steps of 10. On the surface, each step represents an equal increment: 10 additional bars of data. But the proportional change in the information available to the indicator varies enormously across the range. Stepping from 10 bars to 20 bars doubles the lookback window, a 100% increase in the data the indicator considers. Stepping from 190 bars to 200 bars similarly adds 10 bars, but this represents only a 5.3% increase in the data available to the indicator. The indicator’s output changes far less in response to a 5% perturbation than a 100% perturbation, and this is true regardless of whether the strategy has captured a genuine signal or has merely fitted to noise.

The consequence is that parameter plateaux are mathematically expected at the upper end of any lookback range, even in the complete absence of a real edge. An indicator fitted to noise at a lookback of 190 bars will produce nearly identical output at 200 bars, because the two windows share approximately 95% of their data. The apparent stability is not evidence of robustness; it is an artefact of the diminishing marginal information content of each additional bar. Conversely, the lower end of the range, where each step represents a large proportional change, will naturally exhibit greater variability, which may be misinterpreted as fragility even if the strategy does capture a genuine short-lookback effect.

This artefact has practical consequences for strategy selection. A researcher who scans a broad parameter range and selects the plateau region is systematically biased toward longer lookback values, which may correspond to slower, more smoothed versions of the strategy that appear robust in optimisation but are simply insensitive to parameter perturbation because each step changes so little. The correct approach is to evaluate parameter sensitivity on a proportional basis (for example, testing lookbacks of 10, 15, 22, 33, 50, 75, 112, 168, each approximately 50% larger than the last) so that each step represents a comparable change in the information available to the indicator. Very few retail traders or backtesting tutorials employ this technique; the standard linear parameter sweep, with its built-in bias toward apparent stability at longer lookbacks, remains the default.

The heuristic is from a Quantreo newsletter post on the multiple-testing problem (see references). It falls well short of the formal corrections such as Reality Check or PBO in statistical power. It does at least force the researcher to acknowledge that testing 50 variants of an idea is a different statistical exercise from testing one. At $N=50$ , the heuristic discounts the reported return by roughly 92%. ↩︎