DrCIF: Diverse Representation Canonical Interval Forest
- The paper introduces DrCIF, which unifies and extends TSF and RISE by extracting phase-dependent features across multiple time series representations.
- DrCIF computes diverse features from raw, first difference, and periodogram data using a mix of 7 classical and 22 catch22 statistics for robust discrimination.
- DrCIF employs an ensemble of unpruned time-series trees with randomized interval selection, leading to notable accuracy improvements on UCR and UEA benchmarks.
The Diverse Representation Canonical Interval Forest (DrCIF) is an interval-based ensemble classifier for time series classification, introduced as a core component of the HIVE-COTE 2.0 meta-ensemble. DrCIF unifies and extends the principle of extracting discriminatory phase-dependent features from time series intervals by leveraging multiple transformations, an enlarged and diverse feature pool, and a randomized forest-based learning structure. Its design synthesizes strengths of previous interval classifiers—most notably TSF and RISE—and surpasses them in both accuracy and representational richness by targeting local features across raw, differenced, and frequency domains, utilizing both classical summary statistics and the comprehensive catch22 feature suite (Middlehurst et al., 2021).
1. Motivation and Context
HIVE-COTE’s central thesis is that combining classifiers built on diverse time series representations maximizes classification accuracy due to the complementary discriminatory information encoded across domains. In HIVE-COTE 1.0, interval-based constituents included the Time Series Forest (TSF), which utilizes random intervals with classic summary features, and the Random Interval Spectral Ensemble (RISE), which extracts features from spectral representations. DrCIF replaces both by integrating the strengths of these approaches with substantial extensions: it captures local, phase-sensitive features at multiple scales, across both the time and frequency domains, thus greatly enriching the candidate feature set available to interval-based trees. This design is motivated by empirical observations that representations such as the raw series, its first difference, and its periodogram characterize distinct aspects of data, each useful for discrimination in specific contexts (Middlehurst et al., 2021).
2. Core Representations and Feature Extraction
DrCIF operates on three distinct representations for each series of length (for each dimension $1$ to in multivariate settings):
- Raw series:
- First difference: ( of length )
- Periodogram: ( of length 0)
From each sequence, 1 random intervals are selected per base tree. Each interval is defined by a start point 2 and a length 3, with 4 and 5, where 6 is the effective length of the representation. In the multivariate case, an interval is also randomly assigned to a dimension 7.
Within each interval, DrCIF computes a subset of 8 features drawn randomly from a candidate pool of 29 features:
- 7 classical features: mean, standard deviation, least-squares slope, median, interquartile range, minimum, and maximum.
- 22 catch22 features: a canonical subset representing measures of autocorrelation, entropy, distributional characteristics, and fluctuation properties (see Lubba et al., 2019 for detailed definitions).
This results in each tree extracting 9 features per series.
| Representation | Interval Source | Feature Types |
|---|---|---|
| Raw series | $1$0 | 7 classic, 22 catch22 |
| First difference | $1$1 | 7 classic, 22 catch22 |
| Periodogram | $1$2, $1$3 | 7 classic, 22 catch22 |
3. Forest Construction and Training Procedure
DrCIF employs an ensemble of $1$4 unpruned “time-series trees,” with each tree trained on randomly subsampled features extracted from randomly chosen intervals of all three representations. Each node split in the tree is determined by maximizing information gain over the selected $1$5 features. The impurity function can be either Gini impurity,
$1$6
or entropy,
$1$7
where $1$8 is the class frequency vector in a node with $1$9 classes.
Key hyperparameters and their defaults:
- 0 (number of trees): 500
- 1 (intervals per representation per tree): 2
- 3 (features per tree): 10
The trees are grown without pruning, utilizing only the 4 features per tree determined by random attribute subsampling and interval selection. Classification is performed by majority vote over all 5 trees.
Pseudocode for training a DrCIF tree:
Given a training set 6:
- Draw a random subset 7 of 8 features from the 29 candidates.
- For each representation (9), repeat 0 times:
- Randomly select 1 for interval position, length, and dimension.
- For each series 2 and feature 3, compute 4.
- Construct an unpruned tree with 5, splitting nodes by information gain.
4. Computational Complexity and Implementation
Let 6 be the time series length, 7 the number of series, 8 the number of trees, 9 the intervals per representation, and 0 the attributes per tree. The dominant computational cost in DrCIF arises from feature extraction:
- Feature extraction per tree: 1
- Tree construction per tree: 2
- Total training time: 3
Memory requirements are dominated by storage for a single feature matrix (4) and a single tree during construction; total ensemble storage is 5.
Key efficiency optimizations include:
- Randomized interval selection, avoiding exhaustive 6 search
- Attribute subsampling (7) per tree
- Vectorized computation and reuse of intermediate statistics for classic summaries (means, variances)
5. Empirical Performance and Benchmarks
DrCIF demonstrates superior empirical performance among interval-based classifiers. On 112 univariate UCR datasets, averaged over 30 stratified resamples, DrCIF outperforms TSF, CIF, RISE, STSF, and similar interval ensembles. Statistical comparisons using pairwise Wilcoxon signed-rank tests with Holm correction at 8 show DrCIF as the top-ranked interval classifier: test-set accuracy is approximately 1–1.5 percentage points higher than CIF and 2–3 points higher than TSF, with 9. DrCIF’s contributions are central to HIVE-COTE 2.0’s performance, enabling the meta-ensemble to surpass all leading single-representation algorithms (including ROCKET, InceptionTime, TS-CHIEF, and HIVE-COTE 1.0) on both univariate (UCR) and multivariate (UEA) benchmarks (Middlehurst et al., 2021).
6. Significance and Role in Meta-Ensembles
DrCIF exemplifies the design principle that leveraging multiple transformed views of time series and an expanded interval-feature space produces improved discrimination and robustness. Its unification of time, difference, and spectral domain features with diverse summary statistics enables more informative splits within its trees, which ultimately translates to robust majority-vote classification. Within HIVE-COTE 2.0, DrCIF’s strengths are critical to the ensemble’s overall accuracy improvements, as it effectively replaces and improves upon both phase-dependent and spectral interval constituents previously used.
This suggests that interval ensembles like DrCIF provide a highly effective mechanism for exploiting temporal locality and multi-domain redundancy in supervised time series classification, especially when equipped with diverse and empirically validated feature sets (Middlehurst et al., 2021).