Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Head Attention Autoencoders (DeepSupp)

Updated 19 January 2026
  • The paper presents DeepSupp, which integrates dynamic correlation matrix construction with multi-head attention to robustly identify evolving support levels in financial data.
  • It utilizes a symmetric encoder-decoder structure with unsupervised clustering to extract latent market regimes and structural patterns.
  • Extensive evaluations on S&P 500 data show improved support accuracy and market regime sensitivity compared to traditional methods.

Multi-Head Attention Autoencoders (DeepSupp) are a class of permutation-invariant, attention-based unsupervised models for discovering structural patterns in high-dimensional time series, primarily designed for detecting dynamic support levels in financial data. The architecture integrates dynamic feature correlation analysis, @@@@1@@@@, bottlenecked autoencoding, and unsupervised clustering in latent space to select support price thresholds reflective of evolving market microstructure relationships (Kriuk et al., 22 Jun 2025).

1. Dynamic Correlation Matrix Construction

The initial representation for DeepSupp involves transforming raw financial time series into a dynamic sequence of correlation matrices. For each time point tt, a feature vector

Ft=[Closet,  VWAPt,  Volumet,  PriceChangeVolumet,  VolumeRatiot]\mathbf F_t = [\mathrm{Close}_t,\;\mathrm{VWAP}_t,\;\mathrm{Volume}_t,\;\mathrm{PriceChangeVolume}_t,\;\mathrm{VolumeRatio}_t]

is extracted, with VWAPt\mathrm{VWAP}_t denoting the time-tt volume-weighted average price, PriceChangeVolumet\mathrm{PriceChangeVolume}_t capturing price-movement-adjusted volume, and VolumeRatiot\mathrm{VolumeRatio}_t providing a normalized measure of recent volume activity.

A sliding window of length n=32n=32 yields a matrix of past feature vectors {Ft31,...,Ft}\{\mathbf F_{t-31}, ..., \mathbf F_t\}. For each feature pair (i,j)(i,j), the Spearman rank correlation is computed: ρij(t)=16k=1ndk2n(n21)\rho_{ij}^{(t)} = 1 - \frac{6\sum_{k=1}^{n} d_k^2}{n(n^2-1)} where dkd_k is the rank difference at position kk. This procedure generates a symmetric, 32×3232 \times 32 correlation matrix Ct\mathbf C_t per window, employing zero-padding or learnable projections where necessary. This representation serves as the input to the autoencoder.

2. Attention-Based Autoencoder Architecture

The DeepSupp attention-driven autoencoder employs a symmetric encoder-decoder structure centering on multi-head self-attention. Each 32×3232 \times 32 correlation matrix is treated as a set of 32 tokens, each row serving as an input token with embedding dimension dmodel=32d_\mathrm{model}=32.

The encoder sequence consists of:

  • A multi-head attention layer (4 heads, dk=8d_k=8 per head) operating without explicit positional encoding, leveraging the inherent symmetry and permutation invariance of correlation matrices.
  • A token pooling operation (e.g., mean pooling) reduces the attention output to a single $32$-dimensional vector, which is then mapped through two fully connected layers and ReLU activation, compressing to a $16$-dimensional latent vector zt(bottle)\mathbf z_t^{(\mathrm{bottle})}.

The decoder reverses this process, expanding the latent representation back through an MLP to $32$ dimensions, broadcasting to 32 tokens, and reconstructing the matrix via attention.

3. Multi-Head Attention Mechanism

Each multi-head attention layer splits the input into h=4h=4 heads, with per-head projections: Qi=XWiQ,Ki=XWiK,Vi=XWiVQ_i = X W_i^Q, \quad K_i = X W_i^K, \quad V_i = X W_i^V for each i=1,...,4i=1, ..., 4 and WiQ,WiK,WiVR32×8W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{32 \times 8}. Each head computes scaled dot-product attention: Attention(Qi,Ki,Vi)=softmax(QiKiTdk)Vi\mathrm{Attention}(Q_i,K_i,V_i) = \mathrm{softmax} \left( \frac{Q_i K_i^T}{\sqrt{d_k}} \right) V_i The outputs are concatenated and projected via WOR32×32W^O \in \mathbb{R}^{32 \times 32}, followed by layer normalization and addition of the input residual. Absence of positional encodings supports the model’s requirement for permutation invariance in analyzing correlation matrices [(Kriuk et al., 22 Jun 2025), Fig. 2].

4. Training Objective and Latent Space Clustering

The model is trained to minimize mean-squared reconstruction loss between the original and reconstructed correlation matrices: Lrec=1Tt=1TCtC^tF2\mathcal L_{\mathrm{rec}} = \frac{1}{T} \sum_{t=1}^T \left\| \mathbf C_t - \widehat{\mathbf C}_t \right\|_F^2 with L2_2 regularization on weights applied during optimization with Adam. After training, the $16$-dimensional latent representations zt(bottle)\mathbf z_t^{(\mathrm{bottle})} form the empirical basis for unsupervised clustering.

DBSCAN is employed with ϵ=0.1\epsilon=0.1 and min_samples=0.1T\mathrm{min\_samples} = 0.1\,T, clustering the latent codes across time. For each cluster CkC_k, the median price of the corresponding original time indices is computed as the kk-th support level: Sk=median{Pt:tCk}S_k = \mathrm{median}\{ P_t : t \in C_k \} Support levels {Sk}\{S_k\} are sorted in ascending order.

5. Multi-Head Attention Specialization and Market Regime Extraction

Visualization of the attention weights for the four heads (Fig. 3 in (Kriuk et al., 22 Jun 2025)) reveals distinct modes of specialization:

Head Observed Pattern Type Suggested Market Role
1 Smooth, short-term linear patterns Local momentum
2 Similar local patterns with parameter shift Subtle slow regime response
3 Bimodal, block patterns Market regime or block transitions
4 Sparse, power-law distributions Crisis memory/tail events

Empirically, each head is associated with unique statistical distributions (exponential, Gaussian, bimodal, power-law) of attention weights. This suggests that the multi-head architecture captures an array of market dynamics, from local mean reversion to structural regime shifts and volatility clusters. A plausible implication is that such specialization enables the model to robustly identify support levels even amid heterogeneous market conditions.

6. Empirical Performance and Benchmarking

Extensive evaluation on S&P 500 tickers using two years of recorded price and volume data demonstrates that DeepSupp’s multi-head attention autoencoder outperforms six baseline algorithms (including HMM, local minima, fractal, Fibonacci, moving averages, quantile regression) across six financial metrics: Support Accuracy, Price Proximity, Volume Confirmation, Market Regime Sensitivity, Support Hold Duration, and False Breakout Rate. The overall performance score is 0.554±0.0390.554 \pm 0.039, highlighting consistently balanced and low-variance results [(Kriuk et al., 22 Jun 2025), Table 1].

7. Architectural and Methodological Significance

By composing a rolling window-based, high-dimensional correlation tensor and employing an attention-driven, permutation-invariant autoencoder, DeepSupp enables robust unsupervised learning of recurring structural features in financial time series. Its attention heads serve as implicit detectors for multiple time-scale and event-scale dynamics, offering interpretability as well as improved empirical performance relative to standard linear and single-head models. A plausible implication is that this architecture can generalize beyond support detection to broader domains requiring structure discovery in evolving multivariate signals. No ablation study for attention ablation is presented, but analysis of attention patterns supports the qualitative benefit of multi-head specialization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Head Attention Autoencoders (DeepSupp).