TempusBench: An Evaluation Framework for Time-Series Forecasting

Published 13 Apr 2026 in cs.LG | (2604.11529v1)

Abstract: Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces a systematic, leakage-free evaluation framework that addresses past data contamination and metric limitations in TSFM research.
It standardizes hyperparameter tuning using a rolling-window protocol, ensuring fair comparisons across statistical, ML, and deep learning models.
The framework features rich visualization tools for diagnostic analysis, enabling clearer insight into forecast errors across diverse time-series regimes.

TempusBench: A Rigorous Evaluation Framework for Time-Series Forecasting Models

Motivation and Background

The recent proliferation of Time-Series Foundation Models (TSFMs) draws a direct parallel to the expansion of large pretrained models in NLP and vision. However, without systematic, leakage-free, and statistically comprehensive benchmarks, meaningful progress and fair comparison in TSFM development are obstructed. The state of evaluation prior to TempusBench suffered from four principal defects: (1) over-reliance on outdated and/or contaminated datasets, (2) narrow and axis-driven benchmark taxonomies, (3) the absence of systematic hyperparameter optimization across models, and (4) the lack of robust visualization tools for interpretability and diagnostic analysis.

This scenario enables spurious or misleading performance claims in the literature and stymies cumulative progress. Critically, contamination between pretraining and test data (as observed in nearly all prior TSFM benchmarks, with GIFT-Eval being a notable exception due to Moirai2) undermines any claim of zero-shot generalization skill. Moreover, prior benchmarks emphasized only superficial axes of variation—such as forecast horizon, domain, or variate type—while disregarding statistical properties crucial to real-world applications, such as non-stationarity, seasonality, volatility, sparsity, or measurement error. Lastly, the absence of controlled, consistent hyperparameter search leads to fundamentally unfair comparisons with classic statistical or machine learning approaches.

Framework Design and Innovations

TempusBench directly addresses these limitations by rethinking the entire lifecycle of TSFM evaluation—from data curation to metric aggregation and visualization.

Leakage-Free Data Curation: New datasets are carefully selected or synthesized to guarantee non-inclusion in existing TSFM pretraining corpora. Statistical and metadata transparency are strictly enforced. Notably, TempusBench’s data avoids overlap with large pretraining archives (e.g., LOTSA, Chronos, Monash, LTSF, etc).

Expanded Benchmark Taxonomy: TempusBench defines benchmarks along both classical and underrepresented statistical axes: stationarity/non-stationarity, seasonality types (cyclical, regressive, additive, multiplicative), sparsity, data quality (noise and measurement error), data size, and various target types (continuous, count, binary, categorical), as well as the traditional axes (horizon, frequency, variate type, domain). This allows much finer-grained characterization of model failure modes and strengths.

Standardized Hyperparameter Search: All models—statistical, ML, and DL—are evaluated using a consistent rolling-window protocol with a full hyperparameter grid-search and selection based strictly on validation-set error. At each window, optimal hyperparameters are selected and applied to subsequent test windows, thus serializing evaluation and avoiding look-ahead bias.

Rich Visualization and Interpretability: A Tensorboard-based interface enables practitioners/researchers to visually interrogate forecast errors and compare models both quantitatively (aggregate metrics) and qualitatively (forecast shapes), providing direct feedback on statistical regimes where models succeed or fail.

Metrics and Aggregation: Multiple point and probabilistic accuracy metrics are computed: MAPE, MAE, RMSE, MASE, CRPS, Weighted Interval Score, Quantile Score. Benchmarks are synthesized into win rates and baseline-relative skill scores, always handling missing/failed model outputs gracefully.

Empirical Results and Findings

Deterministic Point Forecasting

TimesFM achieves the highest aggregate win rates across MAE, RMSE, and MASE for multivariate tasks: e.g., MAE win rate 0.9057, RMSE win rate 0.8742. However, for MAPE (scale-free), LAFN—despite its much smaller parameter count—shows the highest win rate (0.7931). Classical methods such as Croston’s and ARIMA remain competitive in intermittent-demand or other suitably matched regimes, though they are generally dominated by foundation models on non-seasonal or long-horizon datasets.

Probabilistic Forecasting

On probabilistic metrics (CRPS, WIS, Quantile), the Toto model obtains perfect win rates (1.0), indicating its special proficiency with distributional accuracy—likely due to its explicit modeling of heavy tails and patch-level normalization. Moirai, Lafn, and Chronos also show robust performance, but models such as Lag-Llama and especially Moirai-MoE underperform (win rates approaching zero).

Skill Score Analysis

When compared against a strong baseline (Seasonal Naive), only a subset of models achieve positive skill (i.e., outperform the baseline). Notably, Varmax (0.3264), TimesFM (0.2237), and Croston Classic (0.1145) achieve positive skill scores on MAPE; TimesFM, Toto, Tiny Time Mixer, and Chronos are positive across point metrics. Foundation models not only dominate in raw accuracy but, crucially, outperform naive seasonal extrapolation in a leak-free setting—an achievement not observed in prior contaminated benchmarks.

Parameter Efficiency and Model Scale

LAFN, with only 0.4M parameters, achieves task-level state-of-the-art accuracy on several benchmarks, outperforming foundation models with 100–500x more parameters (e.g., TimesFM at 200M, Chronos at 20M, Moirai at 91M). This strong showing for LAFN indicates that efficient architectural design and model selection may, in some statistical regimes, outweigh scaling alone—resulting in substantial gains for real-world applications constrained by compute or latency.

Implications and Open Problems

TempusBench establishes a new discipline for the evaluation of TSFMs and exposes several important findings. First, consistent, leakage-free benchmarks reveal that much prior literature systematically overestimated model advances; naive models (seasonal naive, Croston) are competitive in a significant portion of the statistical landscape. Second, the scaling of model size must be justified by statistically meaningful generalization: smaller, efficient architectures (LAFN) can achieve comparable or even superior performance for certain tasks. Third, point and distributional forecast accuracy can diverge: models like Toto, which excel in probabilistic metrics, may lag in MAPE or MAE, suggesting the need for specialized architectures or training regimes tailored to end-task requirements.

TempusBench’s modular pipeline and open-source release provide community infrastructure analogous to ImageNet for vision pretraining. Its rigorous control of contamination, statistical diversity, and protocol standardization are prerequisites for robust science in TSFM development.

The framework anticipates important future work:

Extending benchmark coverage via dynamic datasets (synthetic regeneration or live-updating real-world streams) to preempt future pretraining contamination.
Expansion to conditional forecasting settings (covariate-conditioned, scenario-based), multi-modal problems, and hierarchical series.
Formalization of challenge sets reflecting adversarial or rare statistical subspaces.

Conclusion

TempusBench fundamentally reshapes TSFM evaluation by resolving long-standing sources of experimental bias and incompleteness. Its comprehensive curation, standardized evaluation, and extensible design provide the platform upon which future advances—both methodological and theoretical—must be measured. The insights enabled by TempusBench will accelerate progress toward truly universal time series forecasters, foster reproducible research, and clarify the relationship between model complexity and statistical generalization. This framework is poised to become the community reference for benchmarking TSFMs and for understanding the efficacy and limits of both foundation and classic forecasting models (2604.11529).

Markdown Report Issue