TimeSeriesBench: Unified ML Benchmark Suite
- TimeSeriesBench is a comprehensive suite of benchmarks and datasets designed to standardize and improve time-series ML evaluation in industrial and scientific settings.
- It integrates diverse datasets—from industrial anomaly detection to semiconductor manufacturing simulations—to enable robust testing of forecasting, detection, and data infrastructure models.
- Its transparent protocols and standardized splits support reproducibility, transfer learning, and unified evaluation, addressing challenges like concept drift and cold-start issues.
TimeSeriesBench is a suite of rigorously defined benchmarks and datasets for time-series machine learning in industrial and scientific contexts. The term refers to several distinct but thematically linked projects, each designed to advance evaluation standards, reproducibility, and robustness for specific classes of time-series tasks. The most notable instances include: (1) an industrial-grade anomaly detection benchmark and dataset for large-scale monitoring (Si et al., 2024); (2) a discrete-event simulation-based dataset for semiconductor manufacturing ML (Pendyala et al., 2024); (3) a comprehensive forecasting benchmark with long-horizon, multi-physics trajectories (Cyranka et al., 2023); and (4) a benchmarking suite for high-frequency time-series data infrastructure (Barez et al., 2023). Collectively, TimeSeriesBench frameworks address core pain points in model generalization, unified evaluation, and industrial applicability.
1. Scope and Motivations
TimeSeriesBench frameworks originate from recognized deficiencies in the evaluation of time-series ML. Existing benchmarks tend to rely on legacy data sources, per-series modeling, and misleading metrics, limiting their value for industrial deployment and foundational model assessment. The suite was motivated by:
- Operational scalability: Enabling assessment of unified models capable of handling tens of thousands of streams—crucial in modern distributed and IoT systems (Si et al., 2024).
- Realistic evaluation: Incorporation of event-based metrics and challenging real-world or simulated domains (e.g., semiconductor manufacturing (Pendyala et al., 2024); high-frequency trading (Barez et al., 2023)).
- Transfer and generalization: Protocols for zero-shot and cross-domain adaptation, mirroring production cold-starts and concept drift (Si et al., 2024).
- Benchmark fairness: Provision of standardized splits, transparent protocols, and open-source code, ensuring reproducibility and comparability across the academic and industrial ML communities (Cyranka et al., 2023).
- Physics-informed modeling: Surrogate data sets mirrored on formal discrete-event process specifications to facilitate downstream application in surrogate modeling and digital twinning (Pendyala et al., 2024).
2. Benchmark Design and Dataset Construction
TimeSeriesBench encompasses a variety of domain-specific and domain-agnostic benchmarks.
Industrial-Grade Anomaly Detection (Si et al., 2024)
- Dataset composition: Aggregation of six standardized public anomaly detection datasets (AIOPS, WSD, Yahoo, NAB, UCR Archive, NEK), plus synthetic curves for behavior-specific testing.
- Learning schemas:
- Naïve: per-series historical training and testing.
- All-in-One: unified model trained on all series.
- Zero-Shot: train on subset A, infer on disjoint subset B, simulating deployment cold starts.
- Evaluation protocol: All metrics event-level, with reduced-length correction to account for anomaly segment severity (weighting by , where is segment length).
- Splitting: Fixed splits (train/validation/test), and test segments without anomalies are dropped to focus evaluation.
Semiconductor Manufacturing Surrogate Data (Pendyala et al., 2024)
- Simulated factory: Parallel DEVS-based model of a canonical semiconductor “MiniFab.” Complete, coupled DEVS specifications with atomic models for Diffusion, Implantation, Lithography machines and coordinators.
- Variables: Factory throughput, turnaround time, machine-level processing and loading times.
- Scenarios: 372 runs covering 93 lot-size configurations × 4 operational modes (including various repair triggers and arrival patterns).
- Monitored signals: Multivariate time-series of stage-wise and aggregate variables; sampling at 1-minute intervals for trajectories of ~25,000 steps.
- Availability: CSVs and serialized NumPy/Pandas tables; ∼MIT License.
Long-Term Forecasting (Cyranka et al., 2023)
- Dataset suite: 12 datasets (energy, finance, traffic, climate, synthetic ODE/PDE, MuJoCo physics).
- Splitting: 80/20 train/test or fixed-sized splits for simulators, look-back up to 2,000, horizon by default.
High-Frequency Database Benchmark (Barez et al., 2023)
- Workload: Order book + trade data (20–30M rows, sub-10s granularity) for BTC-USD, ETH-USD, USDT-USD.
- Database systems: kdb+, InfluxDB, TimescaleDB, ClickHouse.
- Tasks: Query latency, write throughput, and storage compression, providing a protocol for reproducible infrastructure evaluation.
3. Evaluation Metrics and Experimental Protocols
TimeSeriesBench adopts protocolized metrics, tailored per sub-benchmark:
Anomaly Detection (Si et al., 2024)
- Reduced-length PA: Event-based metrics weighted by length, addressing overcounting or underweighting of extended anomalies.
- Precision/Recall/F1: -weighted across detected and flagged anomaly segments.
- AUPRC: Area under (precision, recall) curve using reduced-length adjustment.
Forecasting (Pendyala et al., 2024, Cyranka et al., 2023)
- MSE:
- MAE:
- MAPE:
- Score, MFE: For deeper model comparisons.
Infrastructure (Barez et al., 2023)
- Throughput:
- Query latency: Mean and 95th percentile.
- Compression ratio:
Modeling paradigms
- Classical (ARIMA, AR, etc.), RNN/LSTM, TCN, TFT, VAE, Transformer architectures: Explicitly benchmarked, with hyperparameters, training regimen, and loss surfaces standardized.
4. Empirical Results and Model Insights
TimeSeriesBench delivers systematic, scenario-matched comparisons. Selected empirical findings:
| Model | Anomaly F1 (AIOPS) | Throughput RMSE (DEVS MiniFab) | Forecasting MSE (Electricity) |
|---|---|---|---|
| AR | 0.92 (naïve) | – | – |
| LSTM | 0.88 (naïve) | 9.73e-9 | 0.144 |
| VAE | 0.94 (UCR) | – | – |
| TCN | – | 2.29e-9 | – |
| TFT | – | 2.11e-8 | 0.098 |
- All-in-One models outperform per-series models for high-noise spike anomalies but can underperform on pattern anomalies (notably for VAEs).
- Zero-shot transfer achieves F1 within 2–3% of all-in-one, demonstrating practical transferability for unknown streams.
- In forecasting, simple linear models (e.g., NLinear, ARIMA) outperform deep baselines on long-horizon, smooth, or deterministic tasks, while transformers and convolutional models excel in stochastic settings (Pendyala et al., 2024, Cyranka et al., 2023).
- DEVS surrogates enable multivariate forecasting and digital-twin surrogacy with high 0 in multistage, cascade throughput forecasting.
5. Use Cases and Extensions
TimeSeriesBench frameworks serve several high-impact roles:
- Industrial anomaly detection: Evaluation of both local (e.g., AR-based) and global (e.g., VAE, transformer) models, under operational constraints such as cold-start and unified-deployment.
- Smart manufacturing: Machine learning development for throughput forecasting, anomaly/concept-drift detection, and process behavior analysis using simulated, high-fidelity time-series (Pendyala et al., 2024).
- Digital twins: Surrogate model training against discrete-event simulation traces.
- Infrastructure benchmarking: Transparent, replicable evaluation of time-series data management systems, enabling informed system design choices for high-frequency domains (Barez et al., 2023).
The modular design (consistent pre-processing, fixed splits, explicit protocol definitions) underpins reproducibility and yields actionable insights for both ML model selection and industrial system operation.
6. Public Availability and Licensing
All major TimeSeriesBench datasets and codebases are openly released:
- DEVS-based semiconductor benchmark, code, and data — MIT License (Pendyala et al., 2024)
- Anomaly detection datasets, leaderboard, protocol, code via EasyTSAD — free academic and commercial use with citation (Si et al., 2024)
- Long-term forecasting configuration, data, and models (Cyranka et al., 2023)
- Database benchmark scripts, data tables (Barez et al., 2023)
The availability of scenario configurations, full raw data, and essential analysis scripts supports longitudinal research and continuous benchmarking aligned with evolving real-world requirements.
7. Current Limitations and Future Perspectives
Current iterations of TimeSeriesBench mainly focus on univariate or modestly multivariate settings, with clear extension routes toward:
- More complex modalities: Inclusion of images or event sequences associated with time-series, multi-modal reasoning, etc.
- Online and incremental protocols: Streaming evaluation and continual learning settings.
- Rich physics-informed or graph-based extensions: Especially for manufacturing and other process-driven domains.
- Integration with foundation model benchmarks: TimeSeriesBench datasets are positioned as robust ground truth for meta-evaluation of emerging multi-task or foundation models.
TimeSeriesBench, by unifying scenario generation, industrial realism, and transparent protocolization, sets a reproducible standard for benchmarking ML models on time-series data across forecasting, anomaly detection, and data infrastructure (Si et al., 2024, Pendyala et al., 2024, Cyranka et al., 2023, Barez et al., 2023).