Time-Bench Dataset: Time Series Benchmark

Updated 3 August 2025

Time-Bench Dataset is a benchmark framework that evaluates machine learning models on diverse time series tasks, including forecasting, anomaly detection, and classification.
It aggregates datasets from various domains such as finance, healthcare, and industry, addressing real-world complexities like irregular sampling, multimodality, and temporal drift.
The framework promotes robust multi-task learning by integrating single-task and parameter-sharing strategies with standardized metrics like MSE, F1 score, and NDCG.

The Time-Bench Dataset refers to a family of benchmark datasets and frameworks designed to holistically assess the performance of machine learning models on time series data, with a particular focus on temporal graphs, multivariate series, multi-domain scenarios, and real-world complexities such as irregular sampling and multimodality. Drawing methodological inspiration from NLP benchmarking suites, Time-Bench encompasses a wide spectrum of time series problem domains (forecasting, anomaly detection, classification) and emphasizes standardized evaluation, diverse data sources, and advanced learning strategies.

1. Conceptual Foundations and Benchmarking Philosophy

Time-Bench’s design is grounded in the benchmarking traditions of NLP, such as GLUE and SuperGLUE, which aggregate performance across diverse, task-oriented datasets with standardized evaluation criteria (Mustafa et al., 14 Oct 2024). The methodology involves careful task definition, collating challenging domain-representative datasets, and reporting aggregate scores. In practice, Time-Bench implements:

Multiple time series tasks: univariate/multivariate forecasting, anomaly detection, and classification.
Curated, domain-diverse datasets with varying periodicities, noise, and structure.
Standardized metrics (e.g., Mean Squared Error, Mean Absolute Error, F1 score, Normalized Discounted Cumulative Gain) adapted to each task, facilitating direct comparison of disparate models.

2. Dataset Composition and Diversity

Time-Bench comprises datasets curated from multiple domains—including finance, healthcare, industry, energy, and e-commerce—thereby mirroring the complexity and heterogeneity encountered in practical time series analytics (Mustafa et al., 14 Oct 2024). Notable components include:

Forecasting: The M4 competition dataset (comprising 100,000 series from finance, economics, industry, and demographics) and fine-grained energy datasets such as Electricity Consuming Load.
Anomaly Detection: The Yahoo labeled web traffic dataset and industrial datasets (e.g., NEK in TimeSeriesBench) for evaluating point- and pattern-wise anomalies (Si et al., 16 Feb 2024).
Temporal Graphs: Domain-spanning collections such as the Temporal Graph Benchmark (TGB) offer social, trade, transaction, and transportation networks with both node- and edge-level temporal tasks (Huang et al., 2023).
Tabular Time Series: TabReD emphasizes temporal drift and feature richness using timestamped industry-grade tabular data (e.g., insurance, housing, logistics) (Rubachev et al., 27 Jun 2024).
Irregular Multimodal Series: Time-IMM models heterogeneous, asynchronous recordings with both numerical and textual modalities under multiple irregularity types (Chang et al., 12 Jun 2025).

This diversity ensures coverage of periodic, non-periodic, stationary, non-stationary, and irregular sampling conditions alongside multimodal augmentations.

3. Task Definition and Multi-Task Learning Integration

The Time-Bench paradigm elevates benchmarking by supporting multi-task learning strategies, reflecting the interconnectedness of real-world time series problem settings (Mustafa et al., 14 Oct 2024). The dataset enables evaluation of:

Single-task models trained independently per task.
Hard parameter sharing: shared backbone with task-specific heads, optimizing a combined loss

$\min \sum_t L_t(\theta_s, \theta_t)$

where $\theta_s$ are shared parameters and $\theta_t$ are task-specific (Mustafa et al., 14 Oct 2024).

Soft parameter sharing: separately parameterized models regularized for similarity.

This architecture facilitates learning of general underlying patterns, promoting robustness across forecasting, classification, and anomaly detection subtasks.

4. Evaluation Protocols and Metrics

Standardization of evaluation criteria is central to the Time-Bench approach. Key protocols and metrics include:

Temporal splits: Data are divided based on chronological order, ensuring that models are tested on future, unseen data—and thus mitigating temporal leakage. For example, training on data with $t_i < T_{val}$ and testing on $t_j \geq T_{val}$ (Rubachev et al., 27 Jun 2024).
Task-specific Metrics:
- Forecasting: Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE).
- Anomaly Detection: F1_best (threshold-optimized F1), AUPRC, event-based scoring using point-adjustment with reduced-length PA and severity coefficients such as $\log(k + e)$ for anomaly segments (Si et al., 16 Feb 2024).
- Classification: Standard accuracy, F1 scores for pattern recognition.
- Temporal graphs: Mean Reciprocal Rank (MRR) for link prediction, NDCG@10 for node affinity prediction (Huang et al., 2023).
Monte Carlo simulations: Used in ForecastTB to evaluate robustness across multiple randomized series segments (Bokde et al., 2020).

Automated pipelines (e.g., TGB’s loaders and experiment managers) guarantee reproducibility and comparability across submissions.

5. Handling Real-World Complexity: Irregularity, Multimodality, and Domain Drift

The benchmark suite systematically incorporates several complexities present in operational time series tasks:

Irregular Sampling and Multimodality: Time-IMM catalogues nine irregularity archetypes, from event-induced logging and adaptive sampling to resource constraints and technical artifacts, capturing practical intricacies in domains like healthcare, finance, and climate (Chang et al., 12 Jun 2025). Its accompanying IMM-TSF library enables asynchronous integration of numerical and text modalities via timestamp-to-text fusion, recency-aware and attention-based strategies.
Temporal Drift and Domain Shifts: TabReD demonstrates that model rankings can shift dramatically under temporal splits as opposed to i.i.d random splits, highlighting the necessity for continual learning and adaptation in production systems (Rubachev et al., 27 Jun 2024).

Dataset/Benchmark	Key Challenge Modeled	Notable Feature
TimeSeriesBench	Industrial anomaly detection	All-in-one, zero-shot generalization
TGB	Temporal graphs	Node/edge prediction with "surprise index"
TabReD	Tabular time series	Rich features, time-based splits
Time-IMM	Irregular multimodal series	Nine types of irregularity; multimodal fusion

6. Empirical Results and Model Insights

Experiments conducted using the benchmark reveal several robust patterns:

Hardness and Model Transfer: Methods that excel on one domain/task may underperform elsewhere; e.g., advanced TGNNs are sensitive to dataset "surprise index" in TGB (Huang et al., 2023).
Simple Model Efficacy: Persistence and moving average methods sometimes outperform deep architectures for node affinity in certain dynamic tasks (Huang et al., 2023), while MLPs and GBDTs show robust performance under realistic temporal splits in tabular data (Rubachev et al., 27 Jun 2024).
Multimodal Gains: Time-IMM demonstrates that joint numerical–textual modeling yields up to 38.38% MSE improvement on certain datasets, with recency-aware and cross-attention fusion crucial in handling asynchronous context (Chang et al., 12 Jun 2025).
Forecasting Insights: In long-term forecasting, linear solver-free models often outperform deep transformers or recurrent models as lookback windows and horizon grow, due to better generalization and computational efficiency (Cyranka et al., 2023).

7. Future Directions and Community Impact

Time-Bench and related datasets are actively maintained, with frequent updates and encouragement of community feedback, open evaluation protocols, and expansion to new domains and tasks (Huang et al., 2023). Anticipated research vectors include:

Development of unified/all-in-one foundation models capable of generalizing across tasks and domains, especially for anomaly detection (Si et al., 16 Feb 2024).
Advancements in robust multi-task and continual learning for temporally and structurally drifting data (Mustafa et al., 14 Oct 2024, Rubachev et al., 27 Jun 2024).
Further integration of multimodal and asynchronous data modalities—beyond tabular and numerical—to mimic an even wider set of real-world deployments (Chang et al., 12 Jun 2025).
Deepening task linkages and meta-learning approaches using the comprehensive, multi-faceted nature of the benchmark.

Time-Bench thus provides a rigorous and evolving substrate for research and practical development, enabling a nuanced understanding of model behavior under authentic, challenging temporal regimes.