GIFT-Eval Benchmark

Updated 6 September 2025

GIFT-Eval Benchmark is a unified framework consolidating diverse time series datasets to ensure reproducible and fair evaluation across various forecasting models.
It integrates 23 datasets from multiple domains with varying frequencies, supporting both univariate and multivariate forecasting tasks.
It curates a massive non-leaking pretraining dataset and a suite of baselines to investigate model generalization and scaling in zero-shot learning.

GIFT-Eval (General Integrated Forecasting for Time series Evaluation) is a comprehensive benchmark and evaluation framework designed to enable rigorous, universal, and reproducible assessment of time series forecasting models, with a particular emphasis on the zero-shot capabilities of foundation models. It integrates diverse datasets, a strict non-leaking pretraining resource, and an extensive suite of baseline models, supporting in-depth quantitative and qualitative analysis across statistical, deep learning, and foundation model paradigms.

1. Motivation and Benchmark Design

GIFT-Eval addresses the absence of a gold-standard benchmark for time series forecasting across real-world domains by consolidating a broad spectrum of datasets and evaluation scenarios into a unified framework. Previous practices in the field largely depended on isolated datasets or domain-specific tests, limiting comparability and generalizability of results, especially for foundation models trained on large heterogeneous corpora. GIFT-Eval delivers a controlled experimental environment, facilitating fair comparison and precise characterization of model strengths and limitations under varying temporal, frequency, and covariate conditions.

2. Dataset Composition and Task Taxonomy

The test split of GIFT-Eval encompasses 23 datasets spanning seven major domains: Econ/Fin, Energy, Healthcare, Nature, Sales, Transport, and Web/CloudOps. The benchmark reflects marked heterogeneity along several axes:

Frequency: Datasets cover time scales from seconds (e.g., 10S, 5T) to minutes, hourly, daily, weekly, monthly, quarterly, and yearly records.
Input Structure: Both univariate (single target) and multivariate (multiple targets or covariates) sequences are present.
Forecast Horizon: For each dataset, three prediction lengths—short, medium, and long—are defined. Medium and long-term horizons are algorithmically extended by factors (e.g., ×10, ×15) based on dataset properties.

This organizational structure ensures GIFT-Eval models the operational diversity encountered in applied forecasting, requiring generalization across temporal and structural regimes.

Domain	# of Datasets	Predominant Frequencies
Econ/Fin	Multiple	Daily, Weekly, Monthly
Energy	Multiple	Hourly, Daily
Healthcare	Several	Daily, Weekly
Nature	Several	Monthly, Yearly
Sales	Several	Weekly, Daily
Transport	Several	Minute, Hourly, Daily
Web/CloudOps	Several	Second, Minute

3. Non-Leaking Pretraining Dataset

GIFT-Eval provides a massive pretraining dataset (~230 billion data points), constructed from LOTSA and various open sources, with strict exclusion of any overlap between pretraining examples and the evaluation/test sets. This curation is critical for fair assessment of zero-shot and transfer learning abilities, ensuring observed improvements reflect model generalization rather than data memorization. The dataset covers a wide array of domains and frequencies, and is intended for foundation model pretraining under protocols that eliminate data leakage. This resource is central for the controlled comparison of pretraining effects and the investigation of scaling phenomena in time series foundation models.

4. Suite of Baselines and Model Families

Seventeen baseline models are systematically evaluated, spanning three principal categories:

Statistical Modeling Approaches: Models include "Naive," "Seasonal Naive," and a set of "Auto" methods (autoregessive variants). These baselines are executed directly per dataset, providing essential comparative benchmarks with minimal or no training.
Deep Learning Models: Includes DeepAR, Temporal Fusion Transformer (TFT), TiDE, N-BEATS, PatchTST, DLinear, Crossformer, and iTransformer. Hyperparameter ranges are tuned as per the appendix specifications. These models are typically robust for short-term forecasting and operate using dataset-specific training paradigms, supporting both point and, in some architectures, probabilistic forecasting via adapted output heads.
Foundation Models: Moirai, Chronos, TimesFM, and VisionTS are pretrained extensively using the GIFT-Eval non-leaking pretraining dataset and evaluated in a zero-shot setting. The paper explicitly addresses data leakage risks, demonstrating via a variant of Moirai how inadvertent overlap in pretraining impacts measured foundations model efficacy.

Performance is compared across aggregate and stratified settings, dissecting results by domain, input type (uni/multivariate), frequency, and prediction length. Foundation models tend to excel in lower-frequency or aggregated settings (e.g., energy datasets), while deep learning models such as PatchTST frequently outperform in high-frequency, highly spiked series. The analysis exposes domain- and configuration-dependent performance crossovers.

5. Evaluation Metrics and Technical Formalizations

GIFT-Eval employs two main metrics for forecast evaluation:

Mean Absolute Percentage Error (MAPE):

$\mathrm{MAPE} = \frac{1}{n} \sum_t \left| \frac{Y_t - \hat{Y}_t}{Y_t} \right|$

Continuous Ranked Probability Score (CRPS): Approximated via discrete quantile summation, CRPS measures the difference between the predicted cumulative distribution function and the true target value.

The forecasting task is formalized as predicting the conditional distribution:

$p(Y_{t:t+h} \mid Y_{t-l:t}, Z_{t-l:t+h})$

where $l$ is the context (look-back) length, $h$ is the forecast horizon, and $Z$ denotes available covariates.

Time series attributes, such as trend and seasonal strength, are quantified through STL decomposition, e.g.,

$\mathrm{Trend} = 1 - \frac{\mathrm{Var}(e_t)}{\mathrm{Var}(f_t + e_t)}$

with results clipped into the $[0,1]$ interval. These operationalizations support systematic capture and analysis of series heterogeneity.

6. Qualitative and Comparative Analysis

The benchmark incorporates both quantitative summaries and qualitative visualizations. Forecast plots (e.g., for BizITObs and Solar datasets) juxtapose typical outputs from deep learning models (PatchTST, DeepAR, N-BEATS, iTransformer) with foundation models (Moirai, Chronos, VisionTS). Notably, deep learning methods can more precisely model high-frequency spikes and periodicity, whereas foundation models often generate smoother but overly smoothed trajectories, and in certain long-term contexts, may drift from actual values. Differences are amplified for univariate versus multivariate tasks, with deep learning architectures displaying greater agility in sudden regime changes and foundation models offering stronger stability but less sensitivity to localized events.

7. Availability, Reproducibility, and Community Resources

All resources underpinning GIFT-Eval—including data preprocessing pipelines, model wrappers, evaluation scripts, full dataset lists (train/test, pretraining), and detailed metadata—are made publicly accessible via the GitHub repository (https://github.com/SalesforceAIResearch/gift-eval). An online leaderboard ranks submissions across the 97 dataset configurations represented in the benchmark, fostering reproducibility and ongoing extension of the framework by the broader research community.

8. Prospects for Future Research

GIFT-Eval lays the groundwork for multiple future research directions in time series modeling:

Optimizing the interplay between foundational pretraining and downstream adaptation to improve medium- and long-horizon accuracy.
Advancing methods for high-frequency data, particularly reducing error accumulation in recursive forecasting.
Enforcing rigorous data leakage controls in pretraining and transfer protocols.
Elucidating scaling laws for time series foundation models, with results suggesting model size interacts significantly with domain and data configuration.

GIFT-Eval thereby articulates open challenges, providing a robust platform for the empirical evaluation and methodological advancement of time series forecasting models.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to GIFT-Eval Benchmark.