Diebold-Mariano Test for Forecast Accuracy

Updated 4 August 2025

Diebold-Mariano Test is a statistical procedure designed to compare forecasting models through loss differentials derived from strictly consistent scoring functions.
It employs robust variance estimation techniques, such as Newey–West, to account for autocorrelation in forecast errors for rigorous hypothesis testing.
The test is applied across fields like risk management and machine learning to enhance model selection and improve forecast accuracy evaluation.

The Diebold-Mariano Test is a formal statistical procedure developed to compare the predictive accuracy of two competing forecasting models. Its principal aim is to provide an objective method for determining whether observed differences in forecast errors are attributable to genuine improvements in model performance or are simply due to random variation. In its classical setting, the test is formulated in terms of a loss differential process derived from applying a suitable loss function to out-of-sample forecast errors. Under appropriate regularity conditions, the Diebold-Mariano (DM) test statistic is asymptotically standard normal, enabling practitioners to conduct rigorous inference on the relative predictive ability of forecasting procedures across a wide variety of domains, including risk management, econometrics, machine learning, and multivariate probabilistic forecasting.

1. Theoretical Foundation and General Structure

The Diebold-Mariano test considers two competing forecasting methods, which, for each forecast origin t, produce errors $e_{1,t}$ and $e_{2,t}$ , or more generally, losses $L(e_{1,t})$ and $L(e_{2,t})$ under a given loss function L. The loss differential is then constructed:

$d_t = L(e_{1,t}) - L(e_{2,t})$

Under the null hypothesis of equal predictive accuracy, $E[d_t] = 0$ . The sample mean of the loss differential is:

$\bar{d} = \frac{1}{T} \sum_{t=1}^T d_t$

The DM test statistic is then formed as:

$DM = \frac{\bar{d}}{\sqrt{\widehat{\operatorname{Var}}(\bar{d})}}$

where $\widehat{\operatorname{Var}}(\bar{d})$ is a consistent estimator of the long-run variance of the loss differentials, typically accounting for any autocorrelation present.

Under the null hypothesis and standard stationarity and weak dependence assumptions, $DM \sim N(0,1)$ asymptotically, supporting classical hypothesis testing for equal predictive accuracy.

2. Use of Strictly Consistent Scoring Functions

A central element in the empirical and theoretical justification for the DM test is the use of strictly consistent scoring functions with respect to the forecast functional of interest (e.g., mean, quantile, or expected shortfall). A scoring function $S(x, y)$ is strictly consistent for a functional $T(F)$ if, for all $x$ and probabilistic forecast distributions $F$ :

$E_F[S(T(F), Y)] \leq E_F[S(x, Y)]$

with equality if and only if $x = T(F)$ . This property guarantees that the true forecast minimizes the expected score and that additional or refined information sets leading to better-calibrated forecasts will realize lower average scores.

This principle extends to risk functionals such as Value-at-Risk (VaR) and Expected Shortfall (ES), where strictly consistent or jointly strictly consistent scoring functions are used. For instance, a scoring function strictly consistent for the $\alpha$ -quantile (VaR) is:

$S(x, y) = (\mathbb{1}\{x \geq y\} - \alpha)[g(x) - g(y)]$

with $g$ strictly increasing.

The DM test is then applied to the corresponding loss differential series formed by these scores, providing a robust framework for comparative backtesting and model assessment in a manner directly aligned with the properties of the forecast object (Holzmann et al., 2014, Fissler et al., 2015, Bauer, 29 May 2025).

3. Methodological Implementation and Variations

The DM test is operationalized as follows:

Construct the loss differential series $d_t$ using strictly consistent scoring functions relative to the target forecast property.
Compute the sample average $\bar{d}$ and estimate the long-run variance of $\bar{d}$ (e.g., using Newey–West or spectral density methods to account for autocorrelation).
Formulate the test statistic as:

$DM = \frac{\bar{d}}{\sqrt{\widehat{\sigma}_d^2/T}}$

where $\widehat{\sigma}_d^2$ is the estimator of the long-run variance, and $T$ is the sample size (Leeuwenburg et al., 12 Jun 2024, Coroneo et al., 19 Sep 2024).

Under $H_0$ (equal predictive accuracy), $DM$ is compared to the quantiles of $N(0,1)$ (or modified distributions under small sample, autocorrelated, or spatially dependent settings).

Several practical versions of the DM test are implemented in recent statistical software, including the scores package (providing original, Harvey et al., and Hering & Genton corrections for improved robustness with autocorrelated data and finite samples) (Leeuwenburg et al., 12 Jun 2024).

In multivariate and risk management settings, extensions involve using vector or joint scoring functions (e.g., for VaR and ES together, or multi-objective setups for systemic risk forecasts), often equipped with lexicographic ordering and multidimensional test statistics (Fissler et al., 2021, Ziel et al., 2019, Fissler et al., 2015).

4. Applications Across Forecasting Domains

The DM test is prominent in a wide variety of empirical studies:

Financial risk management: Comparing VaR and ES forecasts using strictly or jointly consistent loss functions, with scores directly linked to risk measures such as expected shortfall (Holzmann et al., 2014, Bauer, 29 May 2025).
Time series econometrics: Evaluating volatility forecasts, e.g., foundation models versus classical ARFIMA/HAR/RGARCH benchmarks, with robust adjustment for autocorrelated residuals (Goel et al., 16 May 2025).
Multivariate probabilistic forecasting: DM test applied to loss differentials from energy scores, variogram scores, copula scores, and Dawid-Sebastiani scores to assess both marginal and dependency structure forecast quality (Ziel et al., 2019).
Machine learning and ensemble forecasting: Statistically comparing complex model classes (GMDH, GRNN, LSTM, GAM, etc.) or nowcasting strategies, with the DM test operationalizing outperformance in tabular ranking frameworks and selection-based combination strategies (Kumar et al., 2019, Hu et al., 2021, Duttilo et al., 17 Jul 2025, Suna et al., 2020).
High-frequency streaming analytics: Validating model superiority with respect to SMAPE or quadratic loss in heavy-tailed, real-time financial data (Khan et al., 2022).

5. Interpretability, Power, and Limitations

The DM test, while widely adopted, has several important interpretive features and caveats:

Interpretation of Statistical Significance: Significance of the DM statistic implies that one model is statistically superior in terms of the chosen loss function for the given sample and horizon. P-values reported in empirical works (e.g., for RMSE, MAPE, or energy score loss differentials) provide formal support for claims of forecast improvement.
Finite Sample and Power Issues: In extreme quantile applications (tail risk prediction for very small $p$ ), the DM test exhibits low power and susceptibility to skewed and type III errors if the loss differential series is highly asymmetric or based on short sample sizes (Bauer, 29 May 2025). Practitioners should use sufficiently long evaluation windows and consider the bias of standard normal or bootstrapped critical values under such conditions.
Handling Autocorrelation: When loss differentials are strongly autocorrelated (e.g., due to persistent errors, model overlap, or time-varying volatility), the standard normal asymptotics may fail, causing either spurious rejections or loss of power (Coroneo et al., 19 Sep 2024). Bandwidth choice in long-run variance estimation is critical, and practitioners should use autocorrelation-robust methods and, if necessary, adjust via fixed-asymptotic or bootstrap approaches.
Single-Pair Testing and Extensions: The DM test is inherently pairwise; multiple-model comparisons require further multiple-testing control or model confidence set methodology.

6. Practical Workflow and Empirical Illustration

A typical application proceeds as follows:

Choice of Scoring Function: Select a strictly (or jointly) consistent scoring function aligned with the forecast target (e.g., squared error for mean, quantile loss for VaR, joint VaR-ES scoring for risk management).
Loss Differential Calculation: Compute $d_t$ for each time step over the out-of-sample test set.
Variance Estimation: Use appropriate methods (e.g., Newey–West, spectral density estimation) to calculate the long-run variance.
Statistical Decision: Calculate the DM statistic and associated p-value. If significant, conclude that one model has statistically superior predictive accuracy, conditional on the loss function and sample.

In complex forecast evaluation frameworks, additional layers—such as aggregation over multiple validated models or dynamic selection strategies informed by DM p-values—are applied to produce robust point or distributional forecasts, as in contemporary nowcasting and ensemble forecasting studies (Corona et al., 2021, Duttilo et al., 17 Jul 2025, Suna et al., 2020). In high-dimensional contexts (e.g., gridded meteorological or climate forecasts), software such as scores exploits Dask-backed xarray arrays for distributed computation of DM statistics across spatial and temporal domains (Leeuwenburg et al., 12 Jun 2024).

7. Conceptual Extensions and Future Directions

Recent research extends the DM test to:

Joint and Multi-Objective Forecast Assessment: Lexicographically ordered, vector-valued loss differentials for evaluating systemic risk measure forecasts (e.g., CoVaR, CoES), emphasizing multi-objective elicitability and regulatory applications (Fissler et al., 2021).
Robustness under Strong Dependence: Analytical and simulation-based investigations into the diminishing power and risk of spurious results under strong autocorrelation or time-varying volatility (Coroneo et al., 19 Sep 2024, Bauer, 29 May 2025).
Software and Reproducibility: Increased availability of rigorously reviewed and scientifically tested implementations supporting complex data structures (NetCDF, xarray) and workflows for operational environments (Leeuwenburg et al., 12 Jun 2024).

A plausible implication is that, while the DM test remains a cornerstone tool for comparative forecast evaluation, practitioners must carefully attend to the statistical properties of their loss differential series—especially in domains with heavy tails, strong persistence, or very low event probabilities—and adopt robust estimation practices, extended sample windows, and appropriate adjustments to safeguard inferential validity.

Table: Key Elements of the Diebold-Mariano Test

Component	Description	Reference Example
Loss Differential	$d_t = L(e_{1,t}) - L(e_{2,t})$	(Holzmann et al., 2014, Leeuwenburg et al., 12 Jun 2024)
Test Statistic	$DM = \bar{d} / \sqrt{\widehat{\operatorname{Var}}(\bar{d})}$	(Leeuwenburg et al., 12 Jun 2024, Suna et al., 2020)
Scoring Function	Strictly or jointly consistent, aligned to the target (mean, quantile, ES, etc.)	(Fissler et al., 2015, Bauer, 29 May 2025)
Null Hypothesis	$E[d_t] = 0$ (equal predictive accuracy of models)	(Coroneo et al., 19 Sep 2024, Ziel et al., 2019)

This structure ensures that the DM test provides a statistically sound and functionally relevant method for forecast comparison in both univariate and multivariate, as well as classical and modern, forecasting applications.