- The paper introduces a systematic, leakage-free evaluation framework that addresses past data contamination and metric limitations in TSFM research.
- It standardizes hyperparameter tuning using a rolling-window protocol, ensuring fair comparisons across statistical, ML, and deep learning models.
- The framework features rich visualization tools for diagnostic analysis, enabling clearer insight into forecast errors across diverse time-series regimes.
TempusBench: A Rigorous Evaluation Framework for Time-Series Forecasting Models
Motivation and Background
The recent proliferation of Time-Series Foundation Models (TSFMs) draws a direct parallel to the expansion of large pretrained models in NLP and vision. However, without systematic, leakage-free, and statistically comprehensive benchmarks, meaningful progress and fair comparison in TSFM development are obstructed. The state of evaluation prior to TempusBench suffered from four principal defects: (1) over-reliance on outdated and/or contaminated datasets, (2) narrow and axis-driven benchmark taxonomies, (3) the absence of systematic hyperparameter optimization across models, and (4) the lack of robust visualization tools for interpretability and diagnostic analysis.
This scenario enables spurious or misleading performance claims in the literature and stymies cumulative progress. Critically, contamination between pretraining and test data (as observed in nearly all prior TSFM benchmarks, with GIFT-Eval being a notable exception due to Moirai2) undermines any claim of zero-shot generalization skill. Moreover, prior benchmarks emphasized only superficial axes of variationโsuch as forecast horizon, domain, or variate typeโwhile disregarding statistical properties crucial to real-world applications, such as non-stationarity, seasonality, volatility, sparsity, or measurement error. Lastly, the absence of controlled, consistent hyperparameter search leads to fundamentally unfair comparisons with classic statistical or machine learning approaches.
Framework Design and Innovations
TempusBench directly addresses these limitations by rethinking the entire lifecycle of TSFM evaluationโfrom data curation to metric aggregation and visualization.
Leakage-Free Data Curation: New datasets are carefully selected or synthesized to guarantee non-inclusion in existing TSFM pretraining corpora. Statistical and metadata transparency are strictly enforced. Notably, TempusBenchโs data avoids overlap with large pretraining archives (e.g., LOTSA, Chronos, Monash, LTSF, etc).
Expanded Benchmark Taxonomy: TempusBench defines benchmarks along both classical and underrepresented statistical axes: stationarity/non-stationarity, seasonality types (cyclical, regressive, additive, multiplicative), sparsity, data quality (noise and measurement error), data size, and various target types (continuous, count, binary, categorical), as well as the traditional axes (horizon, frequency, variate type, domain). This allows much finer-grained characterization of model failure modes and strengths.
Standardized Hyperparameter Search: All modelsโstatistical, ML, and DLโare evaluated using a consistent rolling-window protocol with a full hyperparameter grid-search and selection based strictly on validation-set error. At each window, optimal hyperparameters are selected and applied to subsequent test windows, thus serializing evaluation and avoiding look-ahead bias.
Rich Visualization and Interpretability: A Tensorboard-based interface enables practitioners/researchers to visually interrogate forecast errors and compare models both quantitatively (aggregate metrics) and qualitatively (forecast shapes), providing direct feedback on statistical regimes where models succeed or fail.
Metrics and Aggregation: Multiple point and probabilistic accuracy metrics are computed: MAPE, MAE, RMSE, MASE, CRPS, Weighted Interval Score, Quantile Score. Benchmarks are synthesized into win rates and baseline-relative skill scores, always handling missing/failed model outputs gracefully.
Empirical Results and Findings
Deterministic Point Forecasting
TimesFM achieves the highest aggregate win rates across MAE, RMSE, and MASE for multivariate tasks: e.g., MAE win rate 0.9057, RMSE win rate 0.8742. However, for MAPE (scale-free), LAFNโdespite its much smaller parameter countโshows the highest win rate (0.7931). Classical methods such as Crostonโs and ARIMA remain competitive in intermittent-demand or other suitably matched regimes, though they are generally dominated by foundation models on non-seasonal or long-horizon datasets.
Probabilistic Forecasting
On probabilistic metrics (CRPS, WIS, Quantile), the Toto model obtains perfect win rates (1.0), indicating its special proficiency with distributional accuracyโlikely due to its explicit modeling of heavy tails and patch-level normalization. Moirai, Lafn, and Chronos also show robust performance, but models such as Lag-Llama and especially Moirai-MoE underperform (win rates approaching zero).
Skill Score Analysis
When compared against a strong baseline (Seasonal Naive), only a subset of models achieve positive skill (i.e., outperform the baseline). Notably, Varmax (0.3264), TimesFM (0.2237), and Croston Classic (0.1145) achieve positive skill scores on MAPE; TimesFM, Toto, Tiny Time Mixer, and Chronos are positive across point metrics. Foundation models not only dominate in raw accuracy but, crucially, outperform naive seasonal extrapolation in a leak-free settingโan achievement not observed in prior contaminated benchmarks.
Parameter Efficiency and Model Scale
LAFN, with only 0.4M parameters, achieves task-level state-of-the-art accuracy on several benchmarks, outperforming foundation models with 100โ500x more parameters (e.g., TimesFM at 200M, Chronos at 20M, Moirai at 91M). This strong showing for LAFN indicates that efficient architectural design and model selection may, in some statistical regimes, outweigh scaling aloneโresulting in substantial gains for real-world applications constrained by compute or latency.
Implications and Open Problems
TempusBench establishes a new discipline for the evaluation of TSFMs and exposes several important findings. First, consistent, leakage-free benchmarks reveal that much prior literature systematically overestimated model advances; naive models (seasonal naive, Croston) are competitive in a significant portion of the statistical landscape. Second, the scaling of model size must be justified by statistically meaningful generalization: smaller, efficient architectures (LAFN) can achieve comparable or even superior performance for certain tasks. Third, point and distributional forecast accuracy can diverge: models like Toto, which excel in probabilistic metrics, may lag in MAPE or MAE, suggesting the need for specialized architectures or training regimes tailored to end-task requirements.
TempusBenchโs modular pipeline and open-source release provide community infrastructure analogous to ImageNet for vision pretraining. Its rigorous control of contamination, statistical diversity, and protocol standardization are prerequisites for robust science in TSFM development.
The framework anticipates important future work:
- Extending benchmark coverage via dynamic datasets (synthetic regeneration or live-updating real-world streams) to preempt future pretraining contamination.
- Expansion to conditional forecasting settings (covariate-conditioned, scenario-based), multi-modal problems, and hierarchical series.
- Formalization of challenge sets reflecting adversarial or rare statistical subspaces.
Conclusion
TempusBench fundamentally reshapes TSFM evaluation by resolving long-standing sources of experimental bias and incompleteness. Its comprehensive curation, standardized evaluation, and extensible design provide the platform upon which future advancesโboth methodological and theoreticalโmust be measured. The insights enabled by TempusBench will accelerate progress toward truly universal time series forecasters, foster reproducible research, and clarify the relationship between model complexity and statistical generalization. This framework is poised to become the community reference for benchmarking TSFMs and for understanding the efficacy and limits of both foundation and classic forecasting models (2604.11529).