2. Data Quality: The Contaminated Foundation

Data·February 2, 2026·backtesting, bt-series, data-quality, vendor-data

2.1 The Taxonomy of Data Errors

A backtest is only as reliable as the data on which it is built, and the quality of freely available and even commercially distributed historical market data is far worse than most traders realise. Data quality issues fall into several categories, each capable of producing misleading results.

Erroneous prints: A single bad tick, a trade printed at a price far from the prevailing market, whether due to a fat-finger error, an exchange reporting glitch, or a data vendor processing failure, can generate an apparent trading opportunity with outsized returns that never actually existed. In tick or minute-level data, such errors may be individually identifiable, but in daily data they can distort the reported high, low, or settlement price without any obvious indication of error. A daily bar showing an anomalous low, for example, could trigger a limit-order entry in a mean-reversion system and subsequently show a profitable exit, when in reality the print that generated the entry signal was erroneous and no fill would have occurred.
Timestamp errors: Markets operate across time zones with varying session times, and data vendors are not always consistent in how they handle session boundaries, daylight saving transitions, or the distinction between electronic and pit trading sessions. A strategy that relies on the timing of price action (for example, a breakout of the prior session high) can produce very different results depending on whether the data vendor defines “session” consistently. Compounding this problem, different approaches to timestamping can be found even within data sets from the same provider. One product may deliver timestamps in UTC, while another is localised to the exchange’s native time zone, and a third uses an arbitrary reference time zone defined by the vendor. The researcher who loads data from multiple instruments or multiple products without verifying the timestamp convention of each is silently mixing time zones, a class of error that will not produce obviously wrong prices but will misalign bars across instruments and corrupt any cross-market signal. It would not be unreasonable to assume that, in 2026, the timestamping of financial data is a solved problem. Data providers seem almost to go out of their way to demonstrate that it is not.
Missing data: Gaps in the record, whether from holidays, half-day sessions, exchange outages, or vendor processing failures, can introduce artefacts into any indicator that relies on a continuous time series. Moving averages and volatility estimates, among others, are sensitive to missing observations.
Phantom data: The inverse of missing data: bars or ticks reported for periods in which no trading actually occurred. This can arise from vendor systems that generate synthetic bars to fill gaps in the record, from incorrect session boundary definitions that attribute activity to the wrong date or session, or from exchange-reported indicative prices being treated as actual trades. A strategy backtested over phantom bars will generate signals and simulated fills at prices that were never available in the market. Unlike erroneous prints, which at least reflect a real (if incorrect) data point, phantom data fabricates trading activity where none existed, and the resulting backtest entries and exits are entirely fictitious.

These error categories are not rare edge cases. They are routine. Data vendor infrastructure and processing pipelines are themselves significant and demonstrably provable sources of error. Vendors aggregate data from multiple exchange feeds, apply their own session definitions, normalise timestamps across time zones, construct continuous contracts, and backfill historical records. Each step introduces opportunities for systematic errors that propagate silently into every downstream backtest. When these errors are identified and reported to vendors, the response is frequently inadequate: errors are acknowledged but not corrected, or corrections are applied only to newly delivered data while the historical record remains contaminated. This may seem implausible, but vendors serving a retail client base that rarely audits data quality have little incentive to invest in the engineering effort required to correct historical records. Traders end up developing and evaluating strategies on data that the vendor itself knows to be defective.¹

I want to be blunt about this: the trader who assumes that commercially distributed data has been rigorously validated is making an assumption that is demonstrably false.

Cryptocurrency markets are a worse case again. Many crypto venues engage in wash trading: fabricated volume reported to inflate apparent liquidity. Credible analyses have estimated that a large fraction of reported crypto volume, even on major exchanges, is non-genuine. A backtest that uses reported volume for signal confirmation or liquidity estimation may be building on data that is, in a meaningful sense, fictional. This is different in kind from the erroneous prints and timestamp errors discussed above. The data is not wrong by accident. It is wrong by design.

The April 2020 WTI crude oil event, in which the May 2020 contract settled at $-37.63 per barrel, exposed a related class of data assumption that few traders had considered. Negative prices broke systems at every level: brokers could not process them, trading platforms rejected them, and data feeds that stored prices as unsigned values simply could not represent them. The event has since been incorporated into historical data sets, but vendors differ in how they handle it. Some report the negative settlement correctly; others clip it to zero or omit the session entirely. Any of these choices contaminates a backtest differently. What this exposed is that data schemas encode implicit assumptions about what prices can be, and those assumptions are occasionally wrong.

2.2 Vendor Divergence and Database Drift

The taxonomy above describes categories of error in the abstract. A more direct way to see the problem is to run an identical strategy against multiple vendors’ data and compare the results. Concretum Group did exactly this: they ran the same Opening Range Breakout strategy, with the same code, the same parameters, and the same nominal time period, against intraday data from five providers (Polygon, Interactive Brokers, Databento, IQFeed, and Alpaca). The terminal portfolio value ranged from roughly $226,000 to $726,000 depending solely on whose data the engine consumed. Identical code; threefold variation in equity.

The mechanisms behind the divergence map onto the error categories described earlier, but several are worth singling out. Phantom highs and lows: isolated price spikes that appeared in one vendor’s bars and nowhere else, reflecting differences in how each vendor filtered trades for inclusion. Stale bars: extended sequences of unchanged OHLC values, typically the result of dropped ticks or improperly aggregated data. Tick-to-bar assignment ambiguity: trades occurring at exact time boundaries (09:35:00.000 being the canonical case) were assigned to different one-minute bars by different vendors, and the resulting bars contained different OHLC values from vendor to vendor. Early-close day leakage: some vendors continued to emit intraday bars past the official close on half-day sessions while others stopped cleanly. Venue coverage divergence: providers that consolidated from SIP feeds saw a different trade population than providers using proprietary subsets, and the implied highs, lows, and closes of identical time intervals differed accordingly.

The amplification effect documented in the Concretum study is itself a useful diagnostic. Strategies whose stops were anchored to intraday highs and lows showed extreme dispersion across vendors. Strategies whose stops were anchored to ATR, a quantity derived from daily ranges with substantial smoothing, converged tightly across the same vendors. If a backtest’s behaviour is sensitive to which vendor’s intraday bars happened to be on the disk that day, the strategy is implicitly trading data-pipeline noise dressed up as a market signal. Running the same code against an independent source is one of the cheapest sanity checks available, and almost nobody does it.

A related and equally damaging problem is temporal drift in a single vendor’s data. The Concretum authors compared their own archive of Interactive Brokers minute bars from 2023 against a fresh download from the same vendor in 2026 for the same instrument and the same dates. The 2023 copy showed continuous price movement across the session. The 2026 copy contained more than 350 out of 390 one-minute bars with zero variation in the same session. Same vendor, same query, different data. Whatever pipeline change caused the divergence, the practical consequence is that any backtest result obtained against a third-party feed is reproducible only if the dataset is archived at the time the backtest is run. Re-running the same code against the same vendor a year later may produce different results because the data has been silently rewritten in between. The retail back-tester who treats the vendor’s database as a stable historical record is making an assumption the vendor itself does not honour.

Snapshot the dataset at the time the backtest is run, and archive it alongside the code. If the strategy depends on intraday extremes, also run it against at least one independent source. None of this is technically demanding. It is, again, a discipline issue, and it is not done because the retail ecosystem assumes the data is fine.

2.3 Historical Trading Venue Transitions

An underappreciated data quality problem in futures markets arises from the historical transition in trading venues. Many futures contracts have passed through three distinct eras: a pit-only era in which all trading occurred on exchange floors via open outcry; a transitional era in which pit and electronic trading coexisted, often with different session hours, different liquidity profiles, and occasionally different price discovery characteristics; and the current electronic-only era in which the pit has been entirely eliminated.

Each era produces data with very different properties. Pit-only data typically reflects shorter trading sessions with different volatility patterns, wider effective spreads, and reporting latencies that can affect the accuracy of recorded timestamps. During the transitional period, the relationship between pit and electronic prices was not always straightforward; the pit session and the electronic session could trade at different prices, and the “official” settlement price might be derived from one venue while most executable liquidity resided in the other. The electronic-only era brought near-continuous trading hours and entirely different order book dynamics.

These eras are not interchangeable. A backtest that treats a twenty- or thirty-year futures history as a homogeneous data set is implicitly assuming that the microstructure of 1995 is comparable to that of 2025. In reality, a strategy that would have been executable in the electronic era may have been impractical in the pit era due to execution delays and wider spreads, quite apart from the impossibility of automated order placement. Conversely, strategies that exploited inefficiencies specific to open-outcry markets (such as the predictable patterns around pit opening and closing) ceased to function when the trading floor was eliminated. Failing to segment historical data by venue regime, or at minimum to acknowledge the structural breaks that venue transitions introduce, can produce backtest results that reflect no single coherent market environment.

2.4 Look-Ahead Bias and Data Asynchronicity

Look-ahead bias is among the most dangerous classes of data error: the implicit assumption that information was available at a time before it actually existed. The most obvious forms (using a future price to calculate a current signal) are easily avoided. But subtler forms pervade standard datasets and are invisible to the researcher who does not understand the provenance of the data.

Market closing times introduce a related form of asynchronicity. European equity markets close several hours before U.S. markets. Asian markets are closed before New York opens. Without relatively sophisticated data management, a strategy that uses daily closing prices across multiple regions may be implicitly assuming these prices are contemporaneous, when in fact they may be separated by half a day or more. A major U.S. market move in the final hours of the New York session will not be reflected in that day’s European or Asian close, creating spurious correlations and false arbitrage signals in the historical record. The retail back-tester working with a simple table of daily closes, one column per market, will see none of this.

Even within a single market, the treatment of missing data introduces silent errors. Most database systems assign a value of zero to a missing data point. A price of zero and an unknown price are not the same thing. The first implies total loss; the second means no data were received. (Most spreadsheet software, incidentally, makes the same conflation.) A system that fails to distinguish between these two states can generate catastrophic false signals: a moving average calculated over a series that includes a spurious zero will produce a wildly distorted output, potentially triggering large trades based on data that never existed. Institutional data operations invest substantial effort in distinguishing between “zero” and “blank” across every field and every timestamp in their databases. The retail trader downloading a CSV file has no such safeguard.

2.5 Survivorship and Selection Bias

Survivorship bias is well-documented in the academic literature on equity backtesting, but its effects extend beyond equities and beyond the commonly discussed problem of universe selection. In futures markets, contracts are periodically delisted or their specifications significantly altered. Commodity markets that no longer trade, or that have been restructured (such as the transition from open-outcry to electronic trading, which transformed market microstructure), are often absent from freely available data sets. A universe of instruments constructed from those currently trading implicitly excludes instruments that failed or were delisted, biasing results toward markets that happened to persist.

However, the more damaging form of survivorship bias operates not at the level of instrument selection but at the level of interpreting backtest output. When a strategy is tested across a large universe of instruments, the temptation to focus on the subset that performed well is almost impossible to resist. A researcher who tests a mean-reversion strategy on fifty futures contracts and finds that it is profitable on twelve of them faces a choice: report the aggregate results across all fifty (which may be mediocre or negative), or present the twelve “validating” instruments as the strategy’s target universe, implicitly discarding the thirty-eight that failed to confirm the hypothesis. The latter approach produces a survivorship-biased result that overstates expected performance, even if each individual backtest is technically correct.

This problem is amplified considerably by what might be called the algorithmic generator approach, the trading equivalent of the infinite monkey theorem. Given sufficient computational resources, it is easy to generate thousands or even millions of strategy variants by systematically permuting parameters, indicator combinations, entry and exit rules, and filter conditions. Among a sufficiently large population of random strategies, some will inevitably show impressive backtested performance purely by chance. The probability of finding at least one strategy with apparently excellent performance metrics approaches certainty as the number of variants tested increases, regardless of whether any genuine edge exists in the underlying logic.

The survivorship bias in this context is severe: out of a million generated strategies, the few hundred that survived the performance filter are presented as discoveries, when they are in fact the expected tail outcomes of a large random sample. The strategies that emerge from such processes are frequently characterised by opaque combinations of indicators, filters, conditions, and timing rules that have no coherent theoretical basis, a phenomenon that could be referred to as “indicator soup.”² The resulting rules may appear sophisticated, but their complexity is an artefact of overfitting rather than evidence of genuine market insight. Without rigorous correction for the number of strategies tested (a correction that is almost never applied in practice), such results are meaningless. This is worse than it sounds, because the researcher is often not fully aware of the number of implicit comparisons made, as is often the case when iterative manual refinement substitutes for explicit combinatorial search. Even a researcher who does not deliberately generate thousands of variants may, through repeated adjustment and re-testing, effectively sample a large strategy space while believing they have tested only a handful of ideas.

A common belief is that the problems of overfitting and survivorship bias in strategy generation can be adequately addressed by reserving a separate out-of-sample period or by employing walk-forward analysis. While these techniques are valuable and represent an improvement over naive in-sample-only evaluation, they do not eliminate the problem. When a large number of strategy variants are subjected to an out-of-sample filter, the variants that pass may simply be those that happened to fit the noise in the out-of-sample period as well as the in-sample period. The out-of-sample test, in effect, becomes another selection criterion in a multi-stage filtering process, and the strategies that survive all stages are not necessarily those with genuine predictive power but rather those whose particular pattern of overfitting happened to generalise to the specific out-of-sample window chosen. Walk-forward analysis mitigates this to some degree by using multiple out-of-sample windows, but it remains vulnerable when the number of candidate strategies is large relative to the effective degrees of freedom in the data. The reassurance that “it passed out-of-sample testing” is less meaningful than it appears.

A subtler failure mode occurs when the researcher iterates between in-sample and out-of-sample periods. A model is fitted in-sample, tested out-of-sample, found wanting, revised based on what was learned from the out-of-sample failure, re-fitted, and re-tested. Each iteration effectively contaminates the out-of-sample data, converting it into a second in-sample period. The researcher believes the final model was validated on unseen data, but the data were not truly unseen: they were examined and used to inform model revisions.

This iterative contamination is nearly universal in practice and is rarely acknowledged.

The problem runs deeper still. An experienced researcher who has tested many strategies over many years develops familiarity with the major events and regime shifts in the historical record: the 1987 crash, the internet bubble, the 2008 financial crisis, the 2020 pandemic sell-off. This familiarity means that even a nominally “fresh” out-of-sample test is compromised: the researcher already knows, at least broadly, what happened during the test period and will unconsciously or consciously design models and select parameters that accommodate those known events. Institutional quantitative firms recognise this problem and take countermeasures that range from separating the strategy research function from the strategy selection function to physically withholding portions of the database from researchers, so that the researcher cannot know what data will be used for out-of-sample validation. Some firms go further, randomising which portions of the data are used for fitting versus testing, or having independent teams conduct the validation. The retail back-tester, working alone with a single dataset and full knowledge of the historical record, has no access to any of these institutional safeguards and is therefore maximally exposed to this form of look-ahead bias.

The underlying methodological failure is an inversion of the scientific method. Properly conducted research begins with a theory derived from observation, deduces testable consequences, and then seeks to falsify those consequences to find evidence that the theory is wrong. A theory that survives repeated attempts at falsification gains credibility, but it is never “proved.” The retail backtesting workflow inverts this process entirely: the researcher searches for parameters that produce attractive historical performance. That is, the researcher seeks confirmation rather than falsification.

The distinction is critical. Seeking confirmation is trivially easy in a noisy dataset with many degrees of freedom; any sufficiently flexible model can be made to fit historical data. Seeking falsification is hard, and it is the hardness that gives the surviving theories their value. A backtest that begins with the question “does this combination of parameters make money?” rather than “under what conditions would this theory fail, and does the evidence show those conditions?” is not applying the scientific method; it is engaging in a sophisticated form of confirmation bias.

A concrete and timely illustration of this failure mode can be seen in the recent proliferation of gold trading strategies among retail systematic traders. Gold reached a succession of all-time highs in 2024 and 2025, and the result has been a predictable surge of interest in developing gold-specific strategies.³ The aspiring trader, having watched gold’s ascent in real time, decides to build a trend-following or breakout system for the gold futures market. They obtain twenty years of historical data, reserve the most recent two or three years as an out-of-sample holdout, fit their model to the earlier period, and then validate it against the withheld data. The out-of-sample period, which happens to coincide with one of the strongest gold rallies in history, produces impressive results, and the trader concludes that the strategy has been rigorously validated on unseen data.

But the entire exercise is contaminated by a form of look-ahead bias that no amount of out-of-sample discipline can correct. The decision to build a gold strategy in the first place was made because gold had recently performed spectacularly well. The trader did not, in 2015 or 2018, survey the universe of tradeable futures markets and select gold on theoretical grounds; the trader selected gold in 2025, with full knowledge that it had reached record highs, and then tested a strategy designed to capture trending behaviour against a holdout period that they already knew (perhaps just subconsciously) contained a powerful trend. The out-of-sample test does not test whether the strategy can identify trends it has never seen. It tests whether a trend-following strategy makes money during a period the trader already knows was dominated by a trend. The answer is almost guaranteed to be yes, and it tells the trader almost nothing about the strategy’s forward viability.

The self-deception is compounded by a failure of counterfactual honesty. The trader must ask: had they been trading this strategy live over the full historical period, would they have persisted through the years when gold went essentially nowhere? Gold spent much of 2013 through 2019 in a broad range, delivering the kind of choppy, mean-reverting price action that is maximally punishing for trend-following systems. A live trader enduring several years of whipsaws and negligible returns would face enormous psychological pressure to abandon the strategy.

Most would.

The backtest, however, glides serenely through this period because the trader viewing the historical equity curve already knows that the payoff is coming. The backtested return includes the years of frustration as though they were costless to endure; in practice, they are anything but. The strategy that “works” over twenty years of history is, for most human traders, untradeable over the difficult middle years that make the long-term return possible.

This pattern (selecting an instrument after observing its recent success, fitting a strategy to its historical data, “validating” the strategy on a holdout period that the trader already knows was favourable, and then treating the result as evidence of a robust edge) is among the most common and most damaging mistakes made by novice systematic traders, and among the hardest to detect, because every individual step in the process appears methodologically sound. The data were split properly; the out-of-sample period was genuinely withheld; the strategy was not re-fitted after seeing the holdout results. But the fatal contamination occurred before any of these steps, at the moment the trader chose the market.

No statistical technique can correct for an instrument selection decision that was itself driven by knowledge of the outcome.

One partial countermeasure, available in some backtesting frameworks but rarely employed by retail traders, is logarithmic detrending of the price series before evaluation. Detrending mathematically removes the dominant directional trend from the historical data, leaving only the deviations around that trend for the strategy to exploit. A trend-following strategy tested on detrended gold data cannot profit simply from the secular upward move; it must demonstrate an ability to capture shorter-term directional movements that would exist regardless of the long-term trajectory. If the strategy’s performance collapses on detrended data, this is strong evidence that the apparent edge was nothing more than the underlying trend, the hindsight artefact described above. Detrending does not eliminate all forms of look-ahead bias (the instrument selection problem remains), but it removes the most obvious one: the strategy that appears to work only because the chosen market happened to go up. Despite the simplicity and diagnostic power of this technique, it is almost unknown among retail systematic traders, in part because few of the popular backtesting platforms (including legacy platforms such as TradeStation, which has no built-in detrending capability) offer it as a standard feature, and in part because the trader who has already convinced themselves of their strategy’s robustness has little motivation to apply a test that might disprove it.

2.6 Continuous Contract Construction

For futures-based strategies, the method of constructing continuous price series from individual contract months is a source of substantial variation in backtest results that is rarely discussed in retail contexts. The back-adjusted (or “Panama”) method preserves point-to-point price changes but produces artificial price levels that may become negative for some instruments over long histories. Ratio-adjusted (proportional) methods preserve percentage returns but alter the apparent magnitude of price moves. Unadjusted series preserve actual prices but introduce discontinuities at roll dates.

Each method produces a different equity curve for the same underlying strategy. Same trades, different numbers. More importantly, strategies that depend on absolute price levels (such as those using support and resistance concepts) may produce entirely spurious signals on back-adjusted data, since the price levels in the adjusted series never actually existed in the market. The qualification matters: it is the levels that are fictitious, not the point-to-point changes. A difference-based rule (a moving-average crossover, a breakout, a momentum signal computed on changes) reads the same series correctly, because it keys off increments that the adjustment preserves rather than off an absolute level the adjustment has shifted. A level-based rule does not, and the shift is not even stable: re-anchoring the series on the next roll moves every historical level again, so the same fixed-threshold rule can fire on different bars depending on when the continuous series was built. This anchor-dependence is the more useful diagnostic than the blanket claim that adjusted data is “unreal” — difference-based rules survive it, level-based rules do not. The choice of roll date, roll method (calendar-based versus volume-based versus open interest-based), and adjustment methodology collectively represent a set of implicit assumptions that many back-testers never examine.

A related and frequently overlooked issue is the change in contract specifications and effective notional value over time. Many futures contracts have undergone substantial changes in contract size, tick value, or margin requirements during their history. Even where specifications have remained nominally stable, the notional value of a single contract can change dramatically with price level. A backtest that trades “one contract” of the E-mini S&P 500 throughout a twenty-year history is implicitly treating a position that represented a modest notional exposure at early-2000s price levels as equivalent to one representing roughly three to four times that exposure at current levels. The risk characteristics of the strategy are therefore not constant across the sample period, and position sizing rules calibrated to current contract values will produce misleading results when applied retrospectively to periods when the same contract represented a very different risk exposure.

The S&P 500 futures complex provides a particularly instructive example of this distortion. When the E-mini contract (ES) was introduced in September 1997, the S&P 500 index stood at roughly 950. The full-size S&P 500 contract (SP), with its $250 multiplier, had a notional value of roughly $237,500.⁴ The E-mini, at one-fifth the size with its $50 multiplier, represented approximately $47,500 of notional exposure. As the index has appreciated over the subsequent decades—recently touching 7,000—the E-mini’s notional value has grown to more than $300,000, comfortably exceeding what the full-size contract represented when the E-mini was introduced. The CME officially delisted the full-size SP contract in 2021 because the E-mini had completely supplanted it. CME Group subsequently launched the Micro E-mini (MES) in 2019 at one-tenth the E-mini’s size, intended to make the S&P 500 accessible to smaller accounts. Yet even the Micro E-mini, with its $5 multiplier, now carries a notional value of more than $30,000 at current index levels—smaller than the original E-mini’s $47,500 at launch, but a far cry from the sub-$10,000 position size that many retail traders assume they are taking on when they trade “the small contract.”

For the back-tester, this means that a strategy tested over twenty years of E-mini data is not operating on a consistent instrument: the contract at the start of the sample is, in economic terms, a wholly different proposition from the contract at the end.

The full-size SP contract itself illustrates a further wrinkle. When the E-mini launched in September 1997, the SP contract carried a $500 multiplier, giving it a notional value near $475,000 at the prevailing index level. Just two months later, in November 1997, the CME halved the multiplier to $250 because the contract had simply grown too large for many participants. Any backtest spanning this period must account for the fact that “one contract” of the SP before and after November 1997 represents a fundamentally different economic exposure, since the post-change contract was half the size of its predecessor. Whether a given data source correctly reflects this multiplier change or silently treats the pre- and post-change contracts as identical, is yet another data nuance that may or may not be handled correctly and that the back-tester must verify independently.

This issue is especially dangerous for strategies that use fixed-dollar stop-losses, a practice that is common among retail traders who calibrate their risk in terms of “the most I am prepared to lose on this trade.” A trader who backtests with a $3,000 stop over a twenty-year history is implicitly assuming that $3,000 represents the same risk tolerance throughout the sample. It does not, for two compounding reasons. First, inflation alone has substantially eroded the purchasing power of $3,000 over two decades: in real terms, $3,000 today buys considerably less than $3,000 did in 2005. Second, and more importantly for strategy mechanics, the contract’s notional value and typical daily range have grown dramatically. A $3,000 stop on an E-mini S&P position twenty years ago, when the contract’s notional value was roughly $60,000 and a typical daily range might have been 15 points ($750), afforded the strategy approximately four days’ worth of adverse movement before triggering the stop. The same $3,000 stop today, when the notional value exceeds $300,000 and a typical daily range may be 60 points barely provides a single day’s cushion. The backtest will show the strategy surviving adverse excursions in the early years that would trigger the stop almost immediately under current conditions. The historical equity curve is therefore constructed not from a single strategy but, effectively, from a series of evolving strategies: the early portion of the equity curve is built from a strategy that has generous room to move and absorb volatility, while the latter portion is effectively scalping with a tight stop. Yet both appear as a single continuous track record. Stops based on volatility measures such as ATR partially mitigate this problem by adapting to the prevailing range, but even ATR-based stops are distorted over long histories if the underlying relationship between range and notional value has shifted, as it has for any contract whose price level has changed a great deal. Continuous contract adjustments compound the problem: back-adjusted series accumulate roll adjustments that grow larger the further back one looks, progressively distorting the relationship between the adjusted price level and the actual trading ranges that prevailed at the time. An ATR calculated on back-adjusted data from twenty years ago may bear little resemblance to the ranges a trader would have actually experienced. Extreme price events create further complications. When WTI crude oil traded at negative prices in April 2020 (the May 2020 contract settling at $-37.63), several brokers were unable to process negative prices in their systems, and the same is true of backtesting platforms. Any framework that stores prices as unsigned values, computes percentage returns, or uses logarithmic transformations will fail on a negative price. Back-adjusted continuous series that pass through this event can produce negative adjusted prices for earlier contracts even if the unadjusted prices were positive, propagating the problem backward through the entire history. A backtest of any strategy spanning this period that does not specifically handle negative prices is unreliable, and the trader who has not checked whether their platform can handle this case is carrying a risk of which they may not be aware.

2.6.1 The drift is real money, not a measurement artefact

There is a subtler point here that is easy to get backwards, and getting it backwards leads to the wrong fix. The cumulative adjustment that back-adjustment piles onto historical data is often described as a distortion to be cleaned away. For absolute price levels that description is correct. For the price changes, and therefore for profit and loss, it is not. The total change of an additively back-adjusted series from the start of the history to the end equals exactly the profit and loss of holding one continuously-rolled position over that period. The roll gaps that the adjustment strips out are not noise; they are the actual cashflows of the roll. At each roll the expiring contract is closed at its price and the next is opened at its price, and the difference between the two is money that genuinely changed hands. Additive back-adjustment is, by construction, faithful to that profit and loss in points. This is precisely why it is the default method, and why “the adjusted prices never existed” — true of the levels — does not imply “the adjusted returns are fictitious.” They are not.

This matters because the drift is not random. In a market that sits persistently in contango, the roll gap carries a consistent sign, and the cumulative sum of same-signed gaps imparts a deterministic, low-noise, highly autocorrelated drift to the level series: persistent contango drives the back-adjusted series steadily downward as one looks further back, persistent backwardation drives it upward. That drift is the roll yield, and it is real return that a position holder earns or pays. The contracts most exposed to driving an adjusted series through zero — low-priced instruments with long histories and a persistent curve sign, such as short-term interest-rate futures and some energy contracts — are exposed precisely because the roll yield is large and one-directional relative to their price, not because of any arithmetic accident.

The practical consequence is a denominator problem that is easy to introduce and hard to see. Any quantity computed as a percentage of the adjusted price — return volatility for position sizing, the entries of a correlation matrix, a log return — is wrong if the adjusted close is used as the denominator, because that denominator is an anchor-dependent fiction that drifts further from the real contract price the further back one looks. The fix is not to abandon additive adjustment but to never divide by the adjusted level: compute each period’s return as the change in the adjusted series divided by the actual contract price that was trading at the time, not by the back-adjusted close.⁵ In practice this means carrying the unadjusted contract price alongside the adjusted series at the resolution the strategy trades on, so the real denominator is always available. A volatility-scaling or correlation routine that silently divides by the adjusted close is carrying a bug that grows with the length of the history, and on a large multi-instrument book it corrupts both the per-instrument position sizes and the cross-instrument risk model at once.

It is also worth mentioning that continuous contract construction in back-testing should be handled in the same way that the trader will trade the strategy in real life. In many cases, strategy developers test strategies on contracts without understanding the specific rules used to construct that construct. Care should be taken to ensure that the contract roll heuristics used in real life are also modelled in the backtest. Furthermore, most backtests don’t factor in roll costs; this may be a factor for strategies that regularly hold positions across contract roll dates.

2.7 Equity Price Adjustments: Dividends, Splits, and Consolidations

The continuous contract construction problem described above has a direct analogue in equities, though it manifests differently and is, if anything, less well understood by the retail back-tester. Equity data vendors routinely supply “adjusted” historical prices that have been retroactively modified to account for corporate actions: stock splits, reverse splits (consolidations), and dividend payments. The adjusted series is intended to produce a continuous return stream, so that a chart or backtest can treat the instrument’s history as though these events had not occurred. The intention is reasonable. The execution is a source of widespread and frequently unrecognised distortion.

Consider a stock that has undergone a 2-for-1 split. On the split date, the share price is halved and the number of outstanding shares is doubled. To prevent the split from appearing as a 50% loss in the historical record, the data vendor retroactively halves all pre-split prices. This preserves the percentage return series: a move from $100 to $110 pre-split becomes a move from $50 to $55 in the adjusted data, and the 10% return is correctly maintained. But the adjusted prices are now fictitious. No one ever traded this stock at $50 before the split; the actual market price was $100. Any strategy that depends on absolute price levels, whether through fixed-dollar stop-losses, support and resistance levels, or round-number effects, will generate signals on the adjusted series that bear no relation to the conditions that actually prevailed. This is the same class of error described above for back-adjusted futures, and it is just as damaging.

Reverse splits (consolidations) introduce the mirror-image problem. A company executing a 1-for-10 reverse split is typically doing so because its share price has fallen to very low levels, often to avoid delisting requirements. The adjusted series retroactively multiplies all pre-consolidation prices by ten, transforming what was in reality a low-priced, wide-spread, difficult-to-trade instrument into one that appears to have always traded at a respectable price level. A backtest run on the adjusted data will simulate entries and exits at these inflated historical prices, entirely concealing the fact that the actual trading environment featured penny-stock spreads, thin liquidity, and the particular microstructure pathologies that afflict very low-priced equities. The strategy’s apparent historical performance includes a period that was, in practice, untradeable at the costs and fill quality the backtest assumes.

Dividend adjustments are subtler but arguably more consequential, because they are universal rather than occasional. When a stock pays a dividend, the standard adjustment methodology reduces all pre-dividend historical prices by the dividend amount (for point-adjusted series) or by the dividend yield (for proportionally adjusted series). Over a long history, for a stock that has paid regular dividends for decades, the cumulative effect of these adjustments is enormous. Historical prices in the adjusted series can be reduced to a small fraction of their actual traded values. A stock that traded at $40 twenty years ago may appear in the adjusted series at $15 or less, once the cumulative effect of two decades of quarterly dividends has been subtracted from the historical record.

This creates several concrete problems for the back-tester. First, as with splits, the adjusted prices are fictional, and any strategy logic that references absolute price levels is operating on numbers that never existed in the market. Second, the adjustment methodology itself varies between vendors. Some vendors adjust for regular dividends but not special dividends. Some adjust for all cash distributions. Some adjust only the closing price; others adjust the full OHLC bar. The same stock, over the same period, can produce noticeably different adjusted price histories depending on the vendor and adjustment method, and consequently different backtest results. The researcher who downloads adjusted data from one source, develops a strategy, and then attempts to validate it against data from another source may find discrepancies that are not errors in either dataset but artefacts of differing adjustment methodologies.

Third, dividend-adjusted data distorts volatility measures in ways that are easy to miss. A stock whose actual price was $40 with a daily range of $1 (2.5% of price) may appear in the adjusted series at $15, but the daily range is also adjusted to approximately $0.38. In absolute terms, the range has been compressed; in percentage terms, it is preserved. This means that any volatility metric calculated in percentage or logarithmic terms (such as standard deviation of returns) will be correct, but any metric calculated in absolute terms (such as ATR in dollars, or a fixed-point stop-loss) will be distorted by the cumulative adjustment. A strategy that uses a dollar-denominated ATR to set position sizes will systematically oversize positions in the early part of the history, where the adjusted prices are artificially low, and the resulting equity curve will overstate both returns and risk in that period.

Fourth, and most damaging, the adjustment methodology embeds future information into past prices. The closing price for a stock on 1 January 2010, in a dividend-adjusted series downloaded today, has been reduced by the cumulative effect of every dividend declared between January 2010 and the day of the download. None of those dividends had been declared at the time the historical bar was created. A strategy that places a stop at a round number, or that triggers an entry on a breach of a fixed price threshold, is responding to a price level that was itself computed using knowledge of dividends that did not yet exist. This is look-ahead bias hidden inside the data. The contamination is structural, embedded in the historical record by future corporate actions, and it does not show up as an implementation error in the strategy code.⁶

The fix is to keep two clean series side by side: an unadjusted series for any logic that depends on absolute levels, and a total-return series for return calculations. Most retail backtesting infrastructure stores only one of the two and forces the researcher to choose. The choice is rarely an informed one.

A further complication arises from the interaction between dividend adjustments and total return calculations. A backtest on price-only (unadjusted) data will understate the returns of a dividend-paying stock, because the dividends are not captured. A backtest on dividend-adjusted data will correctly capture the total return, but only if the researcher understands that the “price” series they are using is not a price series at all; it is a total-return index expressed in price-like units. Mixing adjusted and unadjusted data within the same strategy, or applying logic designed for one to the other, produces results that are internally inconsistent. A cross-sectional strategy that ranks stocks by price, for example, will produce different rankings depending on whether the prices are adjusted or unadjusted, and the adjusted rankings will reflect cumulative dividend history as much as current market valuation, which is almost certainly not the researcher’s intention.

The researcher who downloads an adjusted price series and treats it as though it were a record of actual traded prices is making an error that is conceptually identical to the futures trader who treats a back-adjusted continuous contract as a record of actual futures prices. In both cases, the data have been transformed to serve a specific analytical purpose (preserving return continuity), and using them for any other purpose (absolute price comparisons, dollar-denominated risk calculations, cross-instrument rankings) produces artefacts that the backtest will silently incorporate into its results.

I have encountered vendors who, when confronted with documented errors in their data, pretend that they are serious about fixing it, but never do. ↩︎
The metaphor may be too generous. A soup at least has a recipe. These strategies have the complexity of a recipe with none of the intentionality. ↩︎
The same pattern has played out previously with Bitcoin, natural gas, and crude oil. The instrument changes; the methodological error does not. ↩︎
When the E-mini launched in September 1997, the full-size SP contract actually carried a $500 multiplier, giving it a notional value closer to $475,000 at prevailing index levels. The CME halved the multiplier to $250 just two months later, in November 1997, precisely because the contract had grown too large for many participants. The $237,500 figure cited here reflects the post-adjustment value. ↩︎
Carver (2023) makes this point repeatedly in the context of his own systematic futures work, and his pysystemtrade framework implements it directly: percentage returns are computed against a daily_denominator_price that holds the price of the contract actually being traded, never the back-adjusted level. The same logic applies to any volatility estimate or correlation used for risk scaling. ↩︎
There is a long-running thread on r/algotrading that captures the practical confusion this causes for beginning back-testers: when to use adjusted versus unadjusted data, and what the choice implies for the strategy. The answer depends on whether the strategy reasons in returns or in absolute levels. The fact that the question is asked again every few months suggests the typical retail tutorial fails to address it at all. ↩︎