GIFT-Eval: Time Series Forecast Benchmark
- GIFT-Eval is a standardized, multi-domain benchmark suite for zero-shot time series forecasting with a massive, non-leaking pretraining archive.
- It aggregates 23 diverse datasets spanning 7 domains with varied frequencies and prediction horizons for comprehensive model evaluation.
- It employs rigorous evaluation protocols and normalized metrics against a Seasonal Naive baseline, providing actionable insights into forecasting performance.
GIFT-Eval is a standardized, multi-domain benchmark suite and evaluation protocol for general-purpose time series forecasting models, with a special focus on zero-shot evaluation of foundation models across diverse real-world settings. The benchmark addresses the lack of a unified, large-scale resource for comparing models in terms of accuracy and calibration over a broad array of domains, frequencies, and prediction horizons, while also providing a massive, non-leaking pretraining corpus to enable the development and fair evaluation of foundation architectures (Aksu et al., 2024). GIFT-Eval has become central to quantitative comparisons in the time series foundation model literature, with an evolving leaderboard and robust analytical framework.
1. Scope, Motivation, and Foundational Structure
GIFT-Eval is motivated by the trend in vision and language pretraining, where large, heterogeneous training corpora have driven the emergence of foundation models capable of zero-shot generalization. Prior to GIFT-Eval, time series lacked a single benchmark that simultaneously covers all relevant axes: (i) univariate and multivariate series, (ii) 10 sampling frequencies from secondly to yearly, (iii) a spectrum of prediction horizons, and (iv) strong safeguards against data leakage from training to evaluation (Aksu et al., 2024).
The key resource consists of two components:
- A train/test/validation benchmark of 23 datasets, collectively comprising over 144,000 time series and 177 million data points, spanning seven domains (energy, healthcare, finance/economics, sales, web/cloud ops, transport, nature).
- A ~230 billion point, 4.5 million series, “non-leaking” pretraining archive compiled from 88 open datasets, explicitly excluding all evaluation data, to facilitate the training of foundation models without overlap.
2. Dataset Composition and Coverage
The benchmark train/test set spans the following cross-sectional structure:
- 23 core datasets, selected for domain diversity and public accessibility, each materialized at 1-4 frequencies (e.g., hourly, daily, weekly), and designed to exercise both univariate and multivariate forecasting.
- Each dataset comes with a prescribed set of forecast horizons—short, medium, and long—expressed in native time steps (e.g., 48/480/720 for 10-min Jena Weather).
- Test splits are 10% of each series, always strictly after the training window to prevent lookahead bias.
- Datasets are partitioned to ensure no duplicate entries or derived series cross from pretraining to evaluation splits.
The table below summarizes example datasets and settings:
| Name | Domain | Freq. | #Variates | Prediction Horizons |
|---|---|---|---|---|
| Jena Weather | Nature | 10-min | 21 | 48 / 480 / 720 |
| BizITObs-App | WebOps | 10-sec | 2 | 60 / 600 / 900 |
| ETT1 | Energy | Hourly | 7 | 48 / 480 / 720 |
| Restaurant | Sales | Daily | 1 | 30 |
| M4 Daily | Econ/Fin | Daily | 1 | 14 |
The non-leaking pretraining corpus is drawn from ten well-known archives (BuildingsBench, CloudOpsTSF, GluonTS, Monash, etc.), with no membership or overlap with the 23 evaluation datasets.
3. Evaluation Protocol and Metrics
GIFT-Eval mandates a zero-shot forecasting regime for foundation models: models are pretrained on the non-leaking archive, frozen, and then directly produce test forecasts given only the provided input context, with no fine-tuning permitted on the 23 benchmark series (Aksu et al., 2024).
For statistical and deep learning baselines, models are trained on the 90% train split, validated with a window off the train, and rolled forward to forecast every window in the test split.
Primary evaluation metrics:
- Median Mean Absolute Percentage Error (MAPE):
Used for accuracy of point forecasts.
Approximated via quantiles to handle full probabilistic predictions.
All core metrics are normalized by the Seasonal Naive baseline (last season's value for each step) to ensure comparability across datasets and forecast horizons.
4. Baseline Models and Benchmarking Results
Seventeen model baselines are included in the standardized benchmark analysis:
- Statistical methods: Naive, Seasonal Naive (SN), Auto ARIMA, Auto ETS, Auto Theta
- Dataset-specific deep-learning methods: DeepAR, TFT, TiDE, N-BEATS, PatchTST, DLinear, Crossformer, iTransformer
- Pretrained foundation models (zero-shot): TimesFM, Chronos-T5 (tiny/small/base), Moirai (small/base/large), VisionTS
Aggregate ranking and performance—sampled by domain—are captured in the following summary (values are normalized domain-level CRPS, lower is better):
| Domain | Stat (SN) | PatchTST | TFT | iTF | Chronos | Moirai | VisionTS | Rank |
|---|---|---|---|---|---|---|---|---|
| Econ/Fin | 1.00 | 0.83 | 0.85 | 0.99 | 0.78 | 0.77 | 0.73 | 3.2 |
| Energy | 1.00 | 0.87 | 0.98 | 0.98 | 0.73 | 0.70 | 0.68 | 5.4 |
| Healthcare | 1.00 | 0.63 | 0.87 | 1.16 | 0.58 | 0.59 | 0.53 | 4.2 |
| Nature | 1.00 | 0.71 | 0.65 | 0.40 | 0.57 | 0.43 | 0.38 | 4.1 |
| Sales | 1.00 | 0.48 | 0.50 | 0.50 | 0.37 | 0.37 | 0.36 | 2.8 |
| Transport | 1.00 | 0.79 | 0.64 | 0.50 | 0.55 | 0.50 | 0.48 | 4.8 |
| Web/CloudOps | 1.00 | 0.92 | 0.74 | 0.66 | 0.74 | 0.72 | 0.52 | 4.0 |
Key findings:
- PatchTST is the leading deep-learning model overall.
- Chronos and Moirai are the most consistent foundation models.
- Foundation models outperform all others on low-frequency, univariate, and seasonal benchmarks.
- For high-frequency, noisy, and multivariate settings, deep learning (PatchTST, iTransformer) or statistical ensembles surpass foundation models.
5. Analytical Insights and Model Behavior
GIFT-Eval supports cross-sectional and domain-specific analysis of strengths and failure modes:
- High-frequency (secondly/minutely) and highly multivariate data favor deep or statistical models, suggesting limitations in current foundation architectures for such regimes.
- Low-frequency (daily to yearly), strong seasonality, and trend are where foundation models, especially non-autoregressive encoder–decoder types (Moirai), excel.
- Recursive decoder-only architectures (Chronos, TimesFM) exhibit accumulated forecast errors at long horizons.
- Only Moirai among foundation models handles multivariate series natively.
This suggests that future architectures should adopt richer multitask and multivariate modeling capabilities, and that improving high-frequency coverage in pretraining corpora could close remaining gaps.
6. Recommendations and Impact
The GIFT-Eval design provides actionable guidance:
- Foundation models should be pretrained on the full GIFT-Eval pretraining archive, with evaluations strictly on the non-overlapping test splits.
- Reporting should always include relative metrics to Seasonal Naive.
- Further foundation model progress depends on improvements in multivariate capability, high-frequency behavior, and long-horizon prediction, as well as incorporating explicit seasonality/trend frameworks.
The public leaderboard at https://github.com/SalesforceAIResearch/gift-eval supports automatic evaluation for any conforming model, serving as an objective reference for the community.
7. Access, Usage Protocols, and Future Directions
GIFT-Eval datasets (pretraining, benchmark splits), code, and leaderboard scoring are available in Arrow format under an open license. Researchers submit forecast outputs via a standardized API for evaluation.
The modularity and extensibility of GIFT-Eval enable the addition of new domains, frequencies, and benchmarking scenarios as the scope of time series foundation models evolves. Over time, GIFT-Eval is intended to become the community standard for assessing zero-shot and multi-domain time series forecasting (Aksu et al., 2024).