Papers
Topics
Authors
Recent
Search
2000 character limit reached

UCR Archive: Time Series Benchmark

Updated 25 February 2026
  • The UCR Archive is a benchmark repository that curates diverse univariate and multivariate time series datasets for standardized classification research.
  • It employs fixed splits, z-normalization, and baseline classifiers such as 1-NN with DTW to ensure rigorous and reproducible algorithmic evaluation.
  • Its evolution and widespread adoption have spurred advances in algorithmic taxonomy and empirical practices across domains from bio-signals to motion capture.

The UCR Archive is a foundational benchmark collection for time series classification (TSC) research, originally assembled at the University of California, Riverside (UCR) and the University of East Anglia (UEA). It systematically curates both univariate and multivariate time series datasets, standardizes data formats, and prescribes rigorous evaluation protocols, catalyzing the development and comparative assessment of TSC algorithms. Since its initial release in 2002 with 16 univariate datasets, the archive has expanded to 128 univariate and 30 multivariate tasks as of 2018, covering domains from motion capture to bio-signals. The UCR Archive is universally adopted for benchmarking in TSC and has shaped both the methodology and taxonomy of empirical algorithmic evaluation in the field (Bagnall et al., 2018, Dau et al., 2018).

1. Historical Evolution and Milestones

The UCR Archive’s trajectory is marked by strategic expansions and growing influence:

  • 2002: Eamonn Keogh released the UCR Archive with 16 univariate, z-normalized datasets to provide a common benchmark, positioned as an analogue to the UCI Machine Learning Repository but purpose-built for time series (Dau et al., 2018).
  • 2005–2014: Steady community-driven growth, culminating in 45 datasets.
  • 2015 Expansion: Archive grew to 85 datasets, incorporating image-to-series conversions (Leaf, Yoga), audio (InsectWingbeatSound), and scientific signals (StarLightCurves).
  • 2018 Major Expansion: Increased to 128 datasets, addressing requests for longer/variable-length series, datasets with clearer provenance, and “tiny-train” scenarios (Bagnall et al., 2018, Dau et al., 2018).
  • 2018 Multivariate Launch: The first multivariate TSC (MTSC) collection was published—30 datasets spanning six domains, addressing limitations of the previously dominant (but small and overlapping) Baydogan collection (Bagnall et al., 2018).

The archive’s influence is pervasive: nearly every major TSC algorithm since 2015 is empirically evaluated on all archive datasets, and formal, non-parametric statistics (Friedman, Nemenyi/Bonferroni) are routinely applied for cross-dataset comparison (Bagnall et al., 2018, Middlehurst et al., 2023).

2. Dataset Organization, Scope, and Domains

The UCR Archive’s dataset composition is intentional and diverse, supporting broad benchmarking:

  • Univariate Archive (128 problems): Domains include digitized spectrographs, ECG/EEG readings, wearable-sensor gestures, and image outlines. Series lengths and number of classes (binary to multi-class) vary widely.
  • Multivariate Archive (30 problems as of 2018): Six domains include Human Activity Recognition (9), Motion (4), ECG (3), EEG/MEG (6), Audio Spectra (5), and “Other” (3) (Bagnall et al., 2018).

Representative Multivariate Datasets:

Dataset Train/Test Dim. Length Classes
BasicMotions 40/40 6 100 4
Cricket 108/72 6 1197 12
EigenWorms 128/131 6 17,984 5
InsectWingbeat 30,000/20,000 200 78 10
LSST 2,459/2,466 6 36 14
  • Data Standardization: All multivariate data is preprocessed to ensure equal length per problem and exclusion of series with missing values. Datasets are formatted in Weka multi-instance ARFF files, each case containing one relational attribute per dimension (Bagnall et al., 2018).

The curated dataset diversity is intended to alleviate overfitting to a specific problem type or domain and to enable robust, comparative analytics.

3. Protocols: Splitting, Preprocessing, and Evaluation

Evaluation protocols in the UCR Archive are designed for maximum reproducibility and rigor:

  • Predefined Splits: Each dataset includes a fixed train/test split; these must be reported directly in algorithm evaluations (Bagnall et al., 2018, Dau et al., 2018).
  • Normalization: z-normalization (zero mean, unit variance) per series is strongly recommended; both normalized and unnormalized baseline results are provided.
  • Baseline Classifiers: For each problem, three benchmarks are mandatory:
    • 1-NN with Euclidean Distance (ED)
    • 1-NN with Dynamic Time Warping (DTW), unconstrained
    • 1-NN with constrained DTW (window width optimized via leave-one-out CV)
  • Multivariate Baselines: Three distances—Euclidean dimension-independent (EDI), DTW dimension-independent (DTWI), and DTW dimension-dependent (DTWD)—are supplied, evaluated on both raw and z-normalized data (Bagnall et al., 2018).
  • Statistical Comparison: Non-parametric tests, such as Wilcoxon signed-rank, Friedman, and Nemenyi post-hoc, are standard for reporting algorithmic improvement across datasets. Critical difference diagrams visualize significance (Middlehurst et al., 2023, Bagnall et al., 2018, Dau et al., 2018).
  • Practical Advice: Authors are urged to test simple baseline variants (e.g., kk-NN, smoothing, window parameters) before attributing gains to novel algorithmic contributions (Dau et al., 2018).

A key outcome is the high reproducibility and comparability of all TSC algorithmic research based on this archive.

4. Algorithmic Taxonomy and Benchmarking Advances

The UCR Archive has directly shaped the empirically-driven taxonomy of TSC algorithms (Middlehurst et al., 2023). Eight categories now dominate experimental work:

  • Distance Based: 1-NN DTW, Elastic Ensemble, Proximity Forest (PF), GRAIL
  • Interval Based: TSF, RISE, CIF, DrCIF, QUANT
  • Shapelet Based: STC, RDST, RSF, MrSEQL
  • Dictionary Based: BOSS, cBOSS, WEASEL, TDE
  • Convolution Based: ROCKET, MiniROCKET, MultiROCKET, Hydra
  • Feature Based: Catch22, TSFresh, FreshPRINCE
  • Deep Learning: ResNet, InceptionTime, H-InceptionTime, LiteTime
  • Hybrid Ensembles: HC1/HC2 (HIVE-COTE), TS-CHIEF, RIST

Recent bake-off studies established average ranks and significance groupings. HC2 (HIVE-COTE v2.0) is the most accurate, followed by MultiROCKET+Hydra; both are in the top clique. Recent convolutional and hybrid pipelines deliver state-of-the-art accuracy and efficiency.

Algorithm Category Example Best Algorithm Ave. Rank (112 datasets)
Hybrid HC2 (HIVE-COTE v2.0) ≈1.00
Convolution+Dictionary MultiROCKET+Hydra ≈2.00
Shapelet RDST ≈3.00
Deep Learning H-InceptionTime ≈4.00

These empirical hierarchies are data-driven and reproducible due to the archive (Middlehurst et al., 2023).

5. Critiques, Artifacts, and the UCR Augmented Benchmark

A substantial body of work critiques the canonical use of the UCR Archive for TSC, revealing that many datasets do not require temporal information for classification. Zhang et al. showed that, for up to 50% of UCR datasets, classifiers perform equally well or nearly as well after temporal order is randomly permuted; these datasets are “effectively tabular” (Zhang et al., 26 Mar 2025).

  • Permutation Testing Methodology: Accuracy before vs. after random-phase permutation is compared using paired difference statistics and the Wilcoxon signed-rank test. For a significant subset, permutation had no impact on classifier performance (Zhang et al., 26 Mar 2025).
  • UCR Augmented Benchmark: To address this artifact, a misalignment augmentation is introduced: each series is randomly padded with a Gaussian random walk at both ends, forcing reliance on temporal order. Algorithms are then evaluated at five augmentation strengths across 105 equal-length datasets.
  • Empirical Impact: Under this protocol, the performance of phase-dependent/tabular classifiers (e.g., Rotation Forest, CIF) declines sharply, while phase-independent or misalignment-tolerant methods (STC, MiniROCKET) are robust.

This paradigm shift suggests that robust assessment of time series classifiers should report results not only on the original archive but also on UCR Augmented, or equivalent protocols explicitly testing temporal-dependence (Zhang et al., 26 Mar 2025).

6. Access, Usage, and Community Practices

The UCR Archive, including the multivariate extension, is publicly available:

These community norms ensure transparency and reliable accumulation of empirical evidence in time series research.

7. Influence, Limitations, and Future Directions

The UCR Archive's impact is multifaceted:

  • It has standardized TSC evaluation and motivated rigorous statistical methodology.
  • Empirical algorithmic progress, such as ROCKET-family pipelines and hybrid ensembles, is quantifiable and transparent (Middlehurst et al., 2023).
  • Critiques of “tabular artifacts” have motivated new benchmarks (UCR Augmented) and nuanced interpretation of result tables, especially for phase-dependence and temporal feature utilization (Zhang et al., 26 Mar 2025).

A plausible implication is that future expansion will need to emphasize datasets for which temporal structure is definitively class-relevant and further extend multivariate and real-world, noisy scenarios.

The UCR Archive remains a canonical resource driving both methodological advances and meta-research on fair empirical practice within the time series community (Bagnall et al., 2018, Dau et al., 2018, Middlehurst et al., 2023, Zhang et al., 26 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UCR Archive.