Hydra: Competing Convolutional Kernels

Updated 14 December 2025

The paper introduces Hydra, a transform-based algorithm that unifies dictionary-style pattern counting with ROCKET-like global pooling through competitive kernel groups.
It employs random 1D convolutional kernels to extract hard and soft counts, efficiently summarizing local patterns in time series data.
Empirical results demonstrate Hydra’s superior accuracy and computational efficiency on benchmark and large-scale datasets compared to traditional methods.

Hydra is a transform-based algorithm for time series classification that fuses dictionary methods and random convolutional kernel approaches. Its defining mechanism is the competition among randomly initialized 1D convolutional kernels grouped into fixed-size sets, which enables efficient extraction of local pattern counts and summary statistics from time series data. Hydra is computationally frugal and can interpolate—via a single key hyperparameter—between traditional dictionary-based pattern counting and global pooling strategies of random-kernel methods such as ROCKET. This architecture achieves state-of-the-art classification accuracy on diverse benchmarks while remaining feasible for large-scale datasets (Dempster et al., 2022, Maniar, 7 Dec 2025, Vargas et al., 2023).

1. Core Algorithmic Principles

Hydra constructs $g$ groups of competing convolutional kernels, each group containing $k$ one-dimensional filters of fixed length $\ell$ (typically $\ell=9$ ). Kernels are initialized randomly, with weights drawn i.i.d. from the standard normal distribution $N(0,1)$ , and normalized by mean subtraction and $\ell_1$ normalization. Kernels within a group are applied in parallel to the input series, and at each time point $t$ , the strongest kernel (by response magnitude) is selected as the "winner" for that position.

Mathematically, for input $x \in \mathbb{R}^T$ , kernel $w_{g,i} \in \mathbb{R}^\ell$ (group $g$ , kernel $i$ ), and dilation $d$ , the raw convolutional response is:

$r_{g,i}(t) = (x * w_{g,i})(t) = \sum_{u=1}^\ell w_{g,i}(u) \cdot x_{t + u - \lceil \ell/2 \rceil}$

The winning kernel index in group $g$ at time $t$ is

$i_g^*(t) = \arg\max_{1 \leq i \leq k} |r_{g,i}(t)|$

Two families of features are extracted for each kernel:

Hard-count: $f_{g,i}^{(\mathrm{hard})} = \sum_{t=1}^T \mathbf{1}[i_g^*(t) = i]$
Soft-count: $f_{g,i}^{(\mathrm{soft})} = \sum_{t=1}^T |r_{g,i}(t)| \cdot \mathbf{1}[i_g^*(t) = i]$

The final feature vector $f(x) \in \mathbb{R}^{gk}$ concatenates all counts for all kernels.

2. Architectural Relationship to Dictionary and ROCKET Methods

Hydra explicitly unifies two major time series classification paradigms:

Dictionary methods (e.g., BOSS, WEASEL, TDE): These count frequency of symbolic patterns ("words") over sliding windows. Hydra generalizes this by treating each group of kernels as a dictionary and recording, for each position, the index of the kernel with maximal activation. The hard count is analogous to word frequency, while soft counts aggregate activation magnitude.
ROCKET-family methods: These transform the input via thousands of random kernels, then apply global pooling (max, positive-proportion value [PPV]). In Hydra, setting $k=1$ recovers ROCKET behavior: each kernel acts independently, soft sums compute average pooling, and hard counts compute PPV. Higher $k$ moves toward dictionary-style richness.

The transition from global pooling (ROCKET) to local pattern counting (dictionary) is parameterized by the hyperparameter $k$ , with $G \cdot k = M$ fixed for architectural consistency.

3. Hyperparameterization and Computational Details

Hydra's key hyperparameters are:

Number of dilations $D$ : Typically $D \approx \lceil \log_2 T \rceil$
Number of groups $G$ per dilation
Kernels per group $k$
Kernel length $\ell$ : Default $\ell=9$

The total number of kernels per dilation is $M = G \cdot k$ . The computational cost for transforming $N$ time series of length $T$ is $O(N \cdot D \cdot M \cdot \ell \cdot T / \text{stride})$ , which is practically linear in $N$ , $T$ , and $M$ because $D$ and $\ell$ scale logarithmically and are constants in most regimes.

Step-by-step, the processing pipeline is:

Initialize kernels $W_{g,i}$ and normalize.
Apply convolution for all groups, dilations, and kernels.
At each time point within each group, record hard/soft counts for the maximal (and, optionally, minimal) responses.
Concatenate all features into $f(x)$ .
Fit a linear classifier (ridge regression or logistic regression) using $f(x)$ as input.

No kernel learning takes place; only the final classifier's weights are optimized.

4. Comparative Performance and Empirical Analysis

Hydra exhibits competitive accuracy and efficiency relative to established methods. On the UCR 112-dataset archive (30 resamples, single CPU core) (Dempster et al., 2022):

Method	Total Time	Mean Rank	Datasets Outperformed
HYDRA	~36 min	Lowest	TDE (73/110), MrSQM (69/111)
Rocket	~1 hr		56 won, 53 lost vs HYDRA
MultiRocket	~30 min
TDE	~22 hr	Higher

Feature fusion (HYDRA + MultiRocket) achieves parity with HIVE-COTE 2, an ensemble method costing 500× more computation. On three large UCR datasets, HYDRA independently outperforms Rocket/MiniRocket in accuracy; combination with MultiRocket yields the highest observed accuracy.

ATM event-log studies (Vargas et al., 2023) confirm HYDRA's edge:

Method	Accuracy ± SD	Balanced Acc ± SD	Time / fold
HYDRA+Ridge	0.759±0.048	0.693±0.033	6.5±2.9 s
MiniROCKET+Ridge	0.729±0.042	0.664±0.024	23.3±7.3 s
InceptionTime	0.711±0.060	0.539±0.041	227.7±48.2s

Wilcoxon signed-rank tests (Bonferroni $\alpha^* = 0.0024$ ) confirm HYDRA's superiority over MiniROCKET, ROCKET, and InceptionTime for AUC, balanced accuracy, F1, and minimum sensitivity.

On large-scale MONSTER datasets (up to 1.17M samples, (Maniar, 7 Dec 2025)), HYDRA achieves mean accuracy of 0.7594, with training time of ~0.1s per 1,000 samples and inference time of ~0.1–0.2ms per series.

5. Adaptive Representation and Ablation Insights

Empirical investigations reveal further architectural nuances:

Optimal kernel grouping: Best performance at $k=8$ , $G=64$ (with $M=512$ fixed).
Hard vs. soft counting: The combination of hard counts on minima and soft sums on maxima improves accuracy compared to either statistic alone.
First-order differences: Including time series differences ( $\Delta x$ ) doubles effective dilations and consistently improves accuracy.
Clipping: Applying ReLU to responses is crucial only for $k=1$ (PPV recovery); otherwise negligible at optimal $k$ .

The group-based competition in Hydra distills discriminative patterns into low-dimensional summary counts, enabling efficient learning with linear models despite the fixed, random kernel basis.

6. Meta-Learning and Ensemble Strategies

Hydra is regularly combined with complementary algorithms such as Quant (hierarchical interval quantiles) to improve ensemble performance on massive datasets (Maniar, 7 Dec 2025). Feature-concatenation (e.g., stacking Hydra logits with Quant features) enables novel decision boundaries exceeding the theoretical oracle bound, though prediction-combination ensembles capture only 11% of oracle potential. Actual ensemble gains are limited by the current meta-learning gap; ExtraTrees meta-learners exploit Hydra+Quant features more efficiently than linear Ridge models.

Oracle analyses indicate that Hydra’s correct predictions are unique for approximately 5% of test instances; error correlation with Quant is moderate (mean 0.421), confirming complementary strengths.

7. Limitations and Future Research Directions

Hydra's fixed random kernels preclude direct adaptation to data-specific structures and rely solely on the expressivity of grouped competition statistics. The Ridge classifier may underfit interactions between feature counts; non-linear meta-learners mitigate this to a degree. Meta-learning approaches for ensemble integration remain suboptimal—current methods are unable to fully exploit instance-level and temporal context.

Potential enhancement pathways include learning kernel weights via back-propagation, enriching meta-features with instance-level statistics, and designing deep stacking architectures to capture inter-method dependencies (Maniar, 7 Dec 2025). A plausible implication is that learnable kernel-adaptive Hydra variants may further close the accuracy gap with computationally intensive methods while conserving efficiency.

Hydra’s transform-based pattern competition mechanism forms a compact and efficient time series feature extractor, offering flexible control over representational fidelity and scale. As meta-learning and ensemble integration strategies evolve, Hydra is likely to remain a central component in the broader landscape of scalable time series classification algorithms.