Multi-Head Attention Autoencoders (DeepSupp)
- The paper presents DeepSupp, which integrates dynamic correlation matrix construction with multi-head attention to robustly identify evolving support levels in financial data.
- It utilizes a symmetric encoder-decoder structure with unsupervised clustering to extract latent market regimes and structural patterns.
- Extensive evaluations on S&P 500 data show improved support accuracy and market regime sensitivity compared to traditional methods.
Multi-Head Attention Autoencoders (DeepSupp) are a class of permutation-invariant, attention-based unsupervised models for discovering structural patterns in high-dimensional time series, primarily designed for detecting dynamic support levels in financial data. The architecture integrates dynamic feature correlation analysis, @@@@1@@@@, bottlenecked autoencoding, and unsupervised clustering in latent space to select support price thresholds reflective of evolving market microstructure relationships (Kriuk et al., 22 Jun 2025).
1. Dynamic Correlation Matrix Construction
The initial representation for DeepSupp involves transforming raw financial time series into a dynamic sequence of correlation matrices. For each time point , a feature vector
is extracted, with denoting the time- volume-weighted average price, capturing price-movement-adjusted volume, and providing a normalized measure of recent volume activity.
A sliding window of length yields a matrix of past feature vectors . For each feature pair , the Spearman rank correlation is computed: where is the rank difference at position . This procedure generates a symmetric, correlation matrix per window, employing zero-padding or learnable projections where necessary. This representation serves as the input to the autoencoder.
2. Attention-Based Autoencoder Architecture
The DeepSupp attention-driven autoencoder employs a symmetric encoder-decoder structure centering on multi-head self-attention. Each correlation matrix is treated as a set of 32 tokens, each row serving as an input token with embedding dimension .
The encoder sequence consists of:
- A multi-head attention layer (4 heads, per head) operating without explicit positional encoding, leveraging the inherent symmetry and permutation invariance of correlation matrices.
- A token pooling operation (e.g., mean pooling) reduces the attention output to a single $32$-dimensional vector, which is then mapped through two fully connected layers and ReLU activation, compressing to a $16$-dimensional latent vector .
The decoder reverses this process, expanding the latent representation back through an MLP to $32$ dimensions, broadcasting to 32 tokens, and reconstructing the matrix via attention.
3. Multi-Head Attention Mechanism
Each multi-head attention layer splits the input into heads, with per-head projections: for each and . Each head computes scaled dot-product attention: The outputs are concatenated and projected via , followed by layer normalization and addition of the input residual. Absence of positional encodings supports the model’s requirement for permutation invariance in analyzing correlation matrices [(Kriuk et al., 22 Jun 2025), Fig. 2].
4. Training Objective and Latent Space Clustering
The model is trained to minimize mean-squared reconstruction loss between the original and reconstructed correlation matrices: with L regularization on weights applied during optimization with Adam. After training, the $16$-dimensional latent representations form the empirical basis for unsupervised clustering.
DBSCAN is employed with and , clustering the latent codes across time. For each cluster , the median price of the corresponding original time indices is computed as the -th support level: Support levels are sorted in ascending order.
5. Multi-Head Attention Specialization and Market Regime Extraction
Visualization of the attention weights for the four heads (Fig. 3 in (Kriuk et al., 22 Jun 2025)) reveals distinct modes of specialization:
| Head | Observed Pattern Type | Suggested Market Role |
|---|---|---|
| 1 | Smooth, short-term linear patterns | Local momentum |
| 2 | Similar local patterns with parameter shift | Subtle slow regime response |
| 3 | Bimodal, block patterns | Market regime or block transitions |
| 4 | Sparse, power-law distributions | Crisis memory/tail events |
Empirically, each head is associated with unique statistical distributions (exponential, Gaussian, bimodal, power-law) of attention weights. This suggests that the multi-head architecture captures an array of market dynamics, from local mean reversion to structural regime shifts and volatility clusters. A plausible implication is that such specialization enables the model to robustly identify support levels even amid heterogeneous market conditions.
6. Empirical Performance and Benchmarking
Extensive evaluation on S&P 500 tickers using two years of recorded price and volume data demonstrates that DeepSupp’s multi-head attention autoencoder outperforms six baseline algorithms (including HMM, local minima, fractal, Fibonacci, moving averages, quantile regression) across six financial metrics: Support Accuracy, Price Proximity, Volume Confirmation, Market Regime Sensitivity, Support Hold Duration, and False Breakout Rate. The overall performance score is , highlighting consistently balanced and low-variance results [(Kriuk et al., 22 Jun 2025), Table 1].
7. Architectural and Methodological Significance
By composing a rolling window-based, high-dimensional correlation tensor and employing an attention-driven, permutation-invariant autoencoder, DeepSupp enables robust unsupervised learning of recurring structural features in financial time series. Its attention heads serve as implicit detectors for multiple time-scale and event-scale dynamics, offering interpretability as well as improved empirical performance relative to standard linear and single-head models. A plausible implication is that this architecture can generalize beyond support detection to broader domains requiring structure discovery in evolving multivariate signals. No ablation study for attention ablation is presented, but analysis of attention patterns supports the qualitative benefit of multi-head specialization.