Papers
Topics
Authors
Recent
Search
2000 character limit reached

TimeSeriesBench: Unified ML Benchmark Suite

Updated 16 May 2026
  • TimeSeriesBench is a comprehensive suite of benchmarks and datasets designed to standardize and improve time-series ML evaluation in industrial and scientific settings.
  • It integrates diverse datasets—from industrial anomaly detection to semiconductor manufacturing simulations—to enable robust testing of forecasting, detection, and data infrastructure models.
  • Its transparent protocols and standardized splits support reproducibility, transfer learning, and unified evaluation, addressing challenges like concept drift and cold-start issues.

TimeSeriesBench is a suite of rigorously defined benchmarks and datasets for time-series machine learning in industrial and scientific contexts. The term refers to several distinct but thematically linked projects, each designed to advance evaluation standards, reproducibility, and robustness for specific classes of time-series tasks. The most notable instances include: (1) an industrial-grade anomaly detection benchmark and dataset for large-scale monitoring (Si et al., 2024); (2) a discrete-event simulation-based dataset for semiconductor manufacturing ML (Pendyala et al., 2024); (3) a comprehensive forecasting benchmark with long-horizon, multi-physics trajectories (Cyranka et al., 2023); and (4) a benchmarking suite for high-frequency time-series data infrastructure (Barez et al., 2023). Collectively, TimeSeriesBench frameworks address core pain points in model generalization, unified evaluation, and industrial applicability.

1. Scope and Motivations

TimeSeriesBench frameworks originate from recognized deficiencies in the evaluation of time-series ML. Existing benchmarks tend to rely on legacy data sources, per-series modeling, and misleading metrics, limiting their value for industrial deployment and foundational model assessment. The suite was motivated by:

  • Operational scalability: Enabling assessment of unified models capable of handling tens of thousands of streams—crucial in modern distributed and IoT systems (Si et al., 2024).
  • Realistic evaluation: Incorporation of event-based metrics and challenging real-world or simulated domains (e.g., semiconductor manufacturing (Pendyala et al., 2024); high-frequency trading (Barez et al., 2023)).
  • Transfer and generalization: Protocols for zero-shot and cross-domain adaptation, mirroring production cold-starts and concept drift (Si et al., 2024).
  • Benchmark fairness: Provision of standardized splits, transparent protocols, and open-source code, ensuring reproducibility and comparability across the academic and industrial ML communities (Cyranka et al., 2023).
  • Physics-informed modeling: Surrogate data sets mirrored on formal discrete-event process specifications to facilitate downstream application in surrogate modeling and digital twinning (Pendyala et al., 2024).

2. Benchmark Design and Dataset Construction

TimeSeriesBench encompasses a variety of domain-specific and domain-agnostic benchmarks.

  • Dataset composition: Aggregation of six standardized public anomaly detection datasets (AIOPS, WSD, Yahoo, NAB, UCR Archive, NEK), plus synthetic curves for behavior-specific testing.
  • Learning schemas:
    • Naïve: per-series historical training and testing.
    • All-in-One: unified model trained on all series.
    • Zero-Shot: train on subset A, infer on disjoint subset B, simulating deployment cold starts.
  • Evaluation protocol: All metrics event-level, with reduced-length correction to account for anomaly segment severity (weighting by wj=ln(kj+e)w_j = \ln(k_j + e), where kjk_j is segment length).
  • Splitting: Fixed splits (train/validation/test), and test segments without anomalies are dropped to focus evaluation.
  • Simulated factory: Parallel DEVS-based model of a canonical semiconductor “MiniFab.” Complete, coupled DEVS specifications with atomic models for Diffusion, Implantation, Lithography machines and coordinators.
  • Variables: Factory throughput, turnaround time, machine-level processing and loading times.
  • Scenarios: 372 runs covering 93 lot-size configurations × 4 operational modes (including various repair triggers and arrival patterns).
  • Monitored signals: Multivariate time-series of stage-wise and aggregate variables; sampling at 1-minute intervals for trajectories of ~25,000 steps.
  • Availability: CSVs and serialized NumPy/Pandas tables; ∼MIT License.
  • Dataset suite: 12 datasets (energy, finance, traffic, climate, synthetic ODE/PDE, MuJoCo physics).
  • Splitting: 80/20 train/test or fixed-sized splits for simulators, look-back up to 2,000, horizon H=LH=L by default.
  • Workload: Order book + trade data (20–30M rows, sub-10s granularity) for BTC-USD, ETH-USD, USDT-USD.
  • Database systems: kdb+, InfluxDB, TimescaleDB, ClickHouse.
  • Tasks: Query latency, write throughput, and storage compression, providing a protocol for reproducible infrastructure evaluation.

3. Evaluation Metrics and Experimental Protocols

TimeSeriesBench adopts protocolized metrics, tailored per sub-benchmark:

  • Reduced-length PA: Event-based metrics weighted by length, addressing overcounting or underweighting of extended anomalies.
  • Precision/Recall/F1: ww-weighted across detected and flagged anomaly segments.
  • AUPRC: Area under (precision, recall) curve using reduced-length adjustment.
  • MSE: 1nt=1n(yty^t)2\frac1{n} \sum_{t=1}^n (y_t - \hat y_t)^2
  • MAE: 1nt=1nyty^t\frac1{n} \sum_{t=1}^n |y_t - \hat y_t|
  • MAPE: 1nt=1nyty^tyt+ϵ\frac{1}{n} \sum_{t=1}^n \frac{|y_t - \hat y_t|}{|y_t| + \epsilon}
  • R2R^2 Score, MFE: For deeper model comparisons.
  • Throughput: T=Nbytes loadedtingestT = \frac{N_{\text{bytes loaded}}}{t_{\text{ingest}}}
  • Query latency: Mean and 95th percentile.
  • Compression ratio: C=Son-diskSrawCSVC = \frac{S_{\text{on-disk}}}{S_{\text{raw\,CSV}}}

Modeling paradigms

  • Classical (ARIMA, AR, etc.), RNN/LSTM, TCN, TFT, VAE, Transformer architectures: Explicitly benchmarked, with hyperparameters, training regimen, and loss surfaces standardized.

4. Empirical Results and Model Insights

TimeSeriesBench delivers systematic, scenario-matched comparisons. Selected empirical findings:

Model Anomaly F1 (AIOPS) Throughput RMSE (DEVS MiniFab) Forecasting MSE (Electricity)
AR 0.92 (naïve)
LSTM 0.88 (naïve) 9.73e-9 0.144
VAE 0.94 (UCR)
TCN 2.29e-9
TFT 2.11e-8 0.098
  • All-in-One models outperform per-series models for high-noise spike anomalies but can underperform on pattern anomalies (notably for VAEs).
  • Zero-shot transfer achieves F1 within 2–3% of all-in-one, demonstrating practical transferability for unknown streams.
  • In forecasting, simple linear models (e.g., NLinear, ARIMA) outperform deep baselines on long-horizon, smooth, or deterministic tasks, while transformers and convolutional models excel in stochastic settings (Pendyala et al., 2024, Cyranka et al., 2023).
  • DEVS surrogates enable multivariate forecasting and digital-twin surrogacy with high kjk_j0 in multistage, cascade throughput forecasting.

5. Use Cases and Extensions

TimeSeriesBench frameworks serve several high-impact roles:

  • Industrial anomaly detection: Evaluation of both local (e.g., AR-based) and global (e.g., VAE, transformer) models, under operational constraints such as cold-start and unified-deployment.
  • Smart manufacturing: Machine learning development for throughput forecasting, anomaly/concept-drift detection, and process behavior analysis using simulated, high-fidelity time-series (Pendyala et al., 2024).
  • Digital twins: Surrogate model training against discrete-event simulation traces.
  • Infrastructure benchmarking: Transparent, replicable evaluation of time-series data management systems, enabling informed system design choices for high-frequency domains (Barez et al., 2023).

The modular design (consistent pre-processing, fixed splits, explicit protocol definitions) underpins reproducibility and yields actionable insights for both ML model selection and industrial system operation.

6. Public Availability and Licensing

All major TimeSeriesBench datasets and codebases are openly released:

The availability of scenario configurations, full raw data, and essential analysis scripts supports longitudinal research and continuous benchmarking aligned with evolving real-world requirements.

7. Current Limitations and Future Perspectives

Current iterations of TimeSeriesBench mainly focus on univariate or modestly multivariate settings, with clear extension routes toward:

  • More complex modalities: Inclusion of images or event sequences associated with time-series, multi-modal reasoning, etc.
  • Online and incremental protocols: Streaming evaluation and continual learning settings.
  • Rich physics-informed or graph-based extensions: Especially for manufacturing and other process-driven domains.
  • Integration with foundation model benchmarks: TimeSeriesBench datasets are positioned as robust ground truth for meta-evaluation of emerging multi-task or foundation models.

TimeSeriesBench, by unifying scenario generation, industrial realism, and transparent protocolization, sets a reproducible standard for benchmarking ML models on time-series data across forecasting, anomaly detection, and data infrastructure (Si et al., 2024, Pendyala et al., 2024, Cyranka et al., 2023, Barez et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimeSeriesBench.