Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 172 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Fev-Bench: Time Series Forecasting Benchmark

Updated 1 October 2025
  • Fev-Bench is a comprehensive benchmark for time series forecasting that integrates 100 tasks from 96 datasets across diverse domains.
  • It emphasizes support for covariate-rich scenarios by categorizing static, past dynamic, and future dynamic signals to enhance model accuracy.
  • The framework employs statistically rigorous aggregation metrics and bootstrapped confidence intervals to ensure reproducible model comparisons.

fev-bench is a comprehensive benchmark for time series forecasting that emphasizes empirical rigor, domain diversity, support for covariate-rich scenarios, and reproducible model comparison. It comprises 100 forecasting tasks derived from 96 unique datasets and is accompanied by the fev Python library, which provides infrastructure for standardized and extensible evaluation. fev-bench leverages statistically principled aggregation procedures, including bootstrapped confidence intervals, to analyze model performance along win-rate and skill-score dimensions and to facilitate statistically sound model selection. The benchmark has been employed to evaluate state-of-the-art pretrained, statistical, and baseline models, highlighting both areas of current strength and promising directions for future research in time series forecasting (Shchur et al., 30 Sep 2025).

1. Benchmark Construction and Domain Coverage

fev-bench consists of 100 distinct forecasting tasks across seven application domains, drawn from a variety of established sources:

  • GIFT-Eval and Monash repository datasets
  • Macroeconomic time series
  • Energy generation/consumption
  • Observability data (BOOMLET benchmark)
  • Forecasting competition datasets (Kaggle tasks: store sales, restaurant reservations, Rossmann, Walmart, etc.)
  • Healthcare, air quality, and COVID-19 tracking data

This domain diversity ensures the benchmark covers practical forecasting challenges encountered in sectors such as retail, macroeconomics, energy, healthcare, and environmental monitoring.

2. Covariate Integration: Static and Dynamic

A distinguishing feature is its explicit support for covariates:

  • Across the 100 tasks, 46 incorporate covariates, increasing the realism and complexity of the benchmark.
  • Covariates are categorized as:
    • Static (e.g., item/location identifiers)
    • Past-only dynamic (e.g., lagged variables, historic measurements)
    • Known future dynamic (e.g., holiday indicators, planned events)
  • This design addresses critical gaps in previous benchmarks, which often omit extraneous signals vital in real-world forecasting, particularly for tasks like retail demand prediction where promotional calendars or pricing are known in advance.

A plausible implication is that models capable of leveraging covariates, such as TabPFN-TS (dynamic covariates), exhibit markedly improved performance on covariate-rich tasks—an area where recent pretrained models may require further development.

3. Principled Aggregation and Statistical Evaluation

fev-bench introduces robust aggregation measures and confidence assessment, which underpin model comparison:

  • Average Win Rate (Wⱼ): Fraction of tasks where model jj’s error is lower than a randomly chosen peer, adjusted for ties:

%%%%1%%%%

  • Skill Score (Sⱼ): Relative improvement of each model’s error versus a fixed baseline (typically Seasonal Naive), aggregated via the geometric mean:

Sj=1(r=1Rclip(Erj/Erβ;,u))1/RS_j = 1 - \left(\prod_{r=1}^{R} \text{clip}(E_{rj}/E_{r\beta}; \ell, u)\right)^{1/R}

with clip(x;,u)\text{clip}(x;\ell,u) bounding extreme ratios (=102\ell=10^{-2}, u=100u=100).

  • Bootstrapped Confidence Intervals: Pairwise comparisons use B=1000B=1000 bootstrap samples to estimate the 95% confidence bounds for win rates and skill scores, enabling statistical significance assessment between models.

This methodology clarifies whether observed improvements reflect genuine model advances or mere sample variation.

4. Model Evaluation and Empirical Results

fev-bench evaluates a suite of models:

  • Pretrained models: TiRex, TimesFM-2.5, Chronos-Bolt, Toto-1.0, Moirai-2.0, TabPFN-TS, Sundial
  • Statistical models: AutoETS, AutoARIMA, AutoTheta, SCUM ensemble
  • Baselines: Seasonal Naive, Naive, Drift

Key empirical findings:

Model Category Top Model(s) Notable Performance
Pretrained TiRex, TimesFM-2.5 Lead overall in point (MASE) and probabilistic (SQL) accuracy
Covariate-rich TabPFN-TS Best on dynamic covariate tasks
Multivariate Toto-1.0 Superior on tasks with multivariate targets
  • TiRex and TimesFM-2.5 are overall leaders; statistical significance between them is sometimes indeterminate, yet both outperform other models with high confidence.
  • TabPFN-TS excels specifically in dynamic covariate environments, and Toto-1.0 outperforms in multivariate settings due to native modeling capabilities.
  • These results reveal both areas of progress and underexploited opportunities, such as covariate integration and multivariate modeling.

5. Metrics and Mathematical Formulas

The benchmark uses precise statistical measures:

  • Mean Absolute Scaled Error (MASE):

MASE=1NDHn=1Nd=1D[1an,dt=T+1T+Hyn,d,tz^n,d,t]\text{MASE} = \frac{1}{NDH} \sum_{n=1}^N \sum_{d=1}^D \left[ \frac{1}{a_{n,d}} \sum_{t=T+1}^{T+H} |y_{n,d,t} - \hat{z}_{n,d,t}| \right]

where an,da_{n,d} is the average seasonal error:

an,d=1Tmt=m+1Tyn,d,tyn,d,tma_{n,d} = \frac{1}{T-m} \sum_{t=m+1}^{T} |y_{n,d,t} - y_{n,d,t-m}|

  • Scaled Quantile Loss (SQL): Uses the quantile loss function ρq(y,z^(q))\rho_q(y, \hat{z}^{(q)}) to evaluate probabilistic accuracy, which distinguishes between over- and under-estimates.

These formalizations ensure both point and probabilistic accuracy are measured with mathematical rigor.

6. fev Python Library: Infrastructure for Reproducing Results

The fev library underpins the benchmark’s reproducibility and extensibility:

  • Tasks are specified via YAML, supporting precise configuration of forecast horizon, target/covariate assignment, and metric selection.
  • The software supports rolling window evaluation, modular design via EvaluationWindow and Task objects, and relies only on minimal dependencies (Hugging Face datasets, pydantic).
  • Direct compatibility with packages such as GluonTS, darts, AutoGluon, Nixtla, and sktime permits seamless integration into existing forecasting pipelines.
  • Separation of evaluation from model implementations allows for continual benchmarking, accommodating evolving model releases and custom task additions.

A plausible implication is the establishment of a standardized, extensible computational platform for future forecasting research.

7. Implications and Areas for Future Development

The fev-bench paper reveals several focal points for subsequent work:

  • Covariate Utilization: Current pretrained models rarely exploit covariate signals despite their impact on forecasting performance—expanded support could yield further gains.
  • Multivariate Forecasting: The comparative success of Toto-1.0 in tasks with multivariate targets suggests a need for broader, native multivariate model integration.
  • Aggregation Methods: Continued refinement and sensitivity analysis of aggregation metrics and confidence estimation could improve benchmarking reliability.
  • Task Diversity Expansion: Integrating additional, varied time series tasks will ensure benchmarking continues to match real-world diversity and technical demand.

This suggests a trajectory where benchmarks, infrastructure, and models co-evolve toward higher fidelity and higher utility forecasting evaluation.

Summary Table of fev-bench Features

Benchmark Aspect Detail Significance
Task Diversity 100 tasks, 7 domains, 46 with covariates Reflects real-world forecasting needs
Aggregation Methods Average win rate, skill score, bootstrapped confidence intervals Enables statistical hypothesis testing
Model Support Pretrained, statistical, baseline, covariate, multivariate models Comprehensive performance analysis
Evaluation Metrics MASE, SQL, quantile loss, YAML-defined tasks Rigorous, reproducible comparison
Software Infrastructure fev Python library (minimal deps, extensible, pipeline-compatible) Facilitates reproducible workflows

In summary, fev-bench establishes a state-of-the-art, realistic framework for evaluating time series forecasting models, robustly addressing diverse domain requirements, covariate support, and methodological integrity. It provides both an empirical foundation for model comparison and a developmental roadmap for forecasting research (Shchur et al., 30 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Fev-Bench.