Fev-Bench: Time Series Forecasting Benchmark

Updated 1 October 2025

Fev-Bench is a comprehensive benchmark for time series forecasting that integrates 100 tasks from 96 datasets across diverse domains.
It emphasizes support for covariate-rich scenarios by categorizing static, past dynamic, and future dynamic signals to enhance model accuracy.
The framework employs statistically rigorous aggregation metrics and bootstrapped confidence intervals to ensure reproducible model comparisons.

fev-bench is a comprehensive benchmark for time series forecasting that emphasizes empirical rigor, domain diversity, support for covariate-rich scenarios, and reproducible model comparison. It comprises 100 forecasting tasks derived from 96 unique datasets and is accompanied by the fev Python library, which provides infrastructure for standardized and extensible evaluation. fev-bench leverages statistically principled aggregation procedures, including bootstrapped confidence intervals, to analyze model performance along win-rate and skill-score dimensions and to facilitate statistically sound model selection. The benchmark has been employed to evaluate state-of-the-art pretrained, statistical, and baseline models, highlighting both areas of current strength and promising directions for future research in time series forecasting (Shchur et al., 30 Sep 2025).

1. Benchmark Construction and Domain Coverage

fev-bench consists of 100 distinct forecasting tasks across seven application domains, drawn from a variety of established sources:

GIFT-Eval and Monash repository datasets
Macroeconomic time series
Energy generation/consumption
Observability data (BOOMLET benchmark)
Forecasting competition datasets (Kaggle tasks: store sales, restaurant reservations, Rossmann, Walmart, etc.)
Healthcare, air quality, and COVID-19 tracking data

This domain diversity ensures the benchmark covers practical forecasting challenges encountered in sectors such as retail, macroeconomics, energy, healthcare, and environmental monitoring.

2. Covariate Integration: Static and Dynamic

A distinguishing feature is its explicit support for covariates:

Across the 100 tasks, 46 incorporate covariates, increasing the realism and complexity of the benchmark.
Covariates are categorized as:
- Static (e.g., item/location identifiers)
- Past-only dynamic (e.g., lagged variables, historic measurements)
- Known future dynamic (e.g., holiday indicators, planned events)
This design addresses critical gaps in previous benchmarks, which often omit extraneous signals vital in real-world forecasting, particularly for tasks like retail demand prediction where promotional calendars or pricing are known in advance.

A plausible implication is that models capable of leveraging covariates, such as TabPFN-TS (dynamic covariates), exhibit markedly improved performance on covariate-rich tasks—an area where recent pretrained models may require further development.

3. Principled Aggregation and Statistical Evaluation

fev-bench introduces robust aggregation measures and confidence assessment, which underpin model comparison:

Average Win Rate (Wⱼ): Fraction of tasks where model $j$ ’s error is lower than a randomly chosen peer, adjusted for ties:

$W_j = \frac{1}{R(M-1)} \sum_{r=1}^{R} \sum_{k \neq j} [\mathbb{1}(E_{rj} < E_{rk}) + 0.5 \cdot \mathbb{1}(E_{rj} = E_{rk})]$

Skill Score (Sⱼ): Relative improvement of each model’s error versus a fixed baseline (typically Seasonal Naive), aggregated via the geometric mean:

$S_j = 1 - \left(\prod_{r=1}^{R} \text{clip}(E_{rj}/E_{r\beta}; \ell, u)\right)^{1/R}$

with $\text{clip}(x;\ell,u)$ bounding extreme ratios ( $\ell=10^{-2}$ , $u=100$ ).

Bootstrapped Confidence Intervals: Pairwise comparisons use $B=1000$ bootstrap samples to estimate the 95% confidence bounds for win rates and skill scores, enabling statistical significance assessment between models.

This methodology clarifies whether observed improvements reflect genuine model advances or mere sample variation.

4. Model Evaluation and Empirical Results

fev-bench evaluates a suite of models:

Pretrained models: TiRex, TimesFM-2.5, Chronos-Bolt, Toto-1.0, Moirai-2.0, TabPFN-TS, Sundial
Statistical models: AutoETS, AutoARIMA, AutoTheta, SCUM ensemble
Baselines: Seasonal Naive, Naive, Drift

Key empirical findings:

Model Category	Top Model(s)	Notable Performance
Pretrained	TiRex, TimesFM-2.5	Lead overall in point (MASE) and probabilistic (SQL) accuracy
Covariate-rich	TabPFN-TS	Best on dynamic covariate tasks
Multivariate	Toto-1.0	Superior on tasks with multivariate targets

TiRex and TimesFM-2.5 are overall leaders; statistical significance between them is sometimes indeterminate, yet both outperform other models with high confidence.
TabPFN-TS excels specifically in dynamic covariate environments, and Toto-1.0 outperforms in multivariate settings due to native modeling capabilities.
These results reveal both areas of progress and underexploited opportunities, such as covariate integration and multivariate modeling.

5. Metrics and Mathematical Formulas

The benchmark uses precise statistical measures:

Mean Absolute Scaled Error (MASE):

$\text{MASE} = \frac{1}{NDH} \sum_{n=1}^N \sum_{d=1}^D \left[ \frac{1}{a_{n,d}} \sum_{t=T+1}^{T+H} |y_{n,d,t} - \hat{z}_{n,d,t}| \right]$

where $a_{n,d}$ is the average seasonal error:

$a_{n,d} = \frac{1}{T-m} \sum_{t=m+1}^{T} |y_{n,d,t} - y_{n,d,t-m}|$

Scaled Quantile Loss (SQL): Uses the quantile loss function $\rho_q(y, \hat{z}^{(q)})$ to evaluate probabilistic accuracy, which distinguishes between over- and under-estimates.

These formalizations ensure both point and probabilistic accuracy are measured with mathematical rigor.

6. fev Python Library: Infrastructure for Reproducing Results

The fev library underpins the benchmark’s reproducibility and extensibility:

Tasks are specified via YAML, supporting precise configuration of forecast horizon, target/covariate assignment, and metric selection.
The software supports rolling window evaluation, modular design via EvaluationWindow and Task objects, and relies only on minimal dependencies (Hugging Face datasets, pydantic).
Direct compatibility with packages such as GluonTS, darts, AutoGluon, Nixtla, and sktime permits seamless integration into existing forecasting pipelines.
Separation of evaluation from model implementations allows for continual benchmarking, accommodating evolving model releases and custom task additions.

A plausible implication is the establishment of a standardized, extensible computational platform for future forecasting research.

7. Implications and Areas for Future Development

The fev-bench paper reveals several focal points for subsequent work:

Covariate Utilization: Current pretrained models rarely exploit covariate signals despite their impact on forecasting performance—expanded support could yield further gains.
Multivariate Forecasting: The comparative success of Toto-1.0 in tasks with multivariate targets suggests a need for broader, native multivariate model integration.
Aggregation Methods: Continued refinement and sensitivity analysis of aggregation metrics and confidence estimation could improve benchmarking reliability.
Task Diversity Expansion: Integrating additional, varied time series tasks will ensure benchmarking continues to match real-world diversity and technical demand.

This suggests a trajectory where benchmarks, infrastructure, and models co-evolve toward higher fidelity and higher utility forecasting evaluation.

Summary Table of fev-bench Features

Benchmark Aspect	Detail	Significance
Task Diversity	100 tasks, 7 domains, 46 with covariates	Reflects real-world forecasting needs
Aggregation Methods	Average win rate, skill score, bootstrapped confidence intervals	Enables statistical hypothesis testing
Model Support	Pretrained, statistical, baseline, covariate, multivariate models	Comprehensive performance analysis
Evaluation Metrics	MASE, SQL, quantile loss, YAML-defined tasks	Rigorous, reproducible comparison
Software Infrastructure	fev Python library (minimal deps, extensible, pipeline-compatible)	Facilitates reproducible workflows

In summary, fev-bench establishes a state-of-the-art, realistic framework for evaluating time series forecasting models, robustly addressing diverse domain requirements, covariate support, and methodological integrity. It provides both an empirical foundation for model comparison and a developmental roadmap for forecasting research (Shchur et al., 30 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

fev-bench: A Realistic Benchmark for Time Series Forecasting (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Fev-Bench.