HVDA: Horizontal-Vertical Detail Attention
- HVDA is a novel module that integrates horizontal recalibration and vertical channel gating to efficiently enhance transformer token dependencies.
- Horizontal attention employs learned re-weighting of multi-head outputs before projection to prioritize informative features.
- Vertical attention applies channel-wise gating post-projection to refine feature representation with minimal computational overhead.
Horizontal–Vertical Detail Attention (HVDA) is an architectural module designed to augment self-attention mechanisms in Transformers by incorporating two orthogonal attention strategies: horizontal attention, which recalibrates multi-head outputs before projection, and vertical attention, which re-weights feature channels post-projection via explicit channel-wise modelling. Both mechanisms aim to enrich feature representation and token dependency modeling with negligible computational and parameter overhead, and can be modularly integrated into standard Transformer blocks (Yu et al., 2022).
1. Formal Definitions and Module Motivation
Let denote the sequence of input tokens, each with -dimensional features. Multi-head Scaled Dot-Product Attention (SDPA) produces parallel head outputs , , with typical dimensionalities per head.
- Horizontal attention introduces a learned re-weighting vector (per token) to emphasize "more informative" heads before the usual linear projection, replacing the concatenation and projection step with:
instead of .
- Vertical attention recalibrates the 0 channels of 1 (the projected multi-head output) via a channel-wise gating vector 2, yielding rescaled output 3.
The stated motivation is to enhance representation distinctiveness, model informative head outputs more selectively, and capture inter-channel dependencies (Yu et al., 2022).
2. Mathematical Formulation
The HVDA module formalizes attention recalibration via explicit learned transformations.
2.1 Horizontal Attention
Given 4 and 5:
- For each head 6:
7
8
- Stack 9 across heads: 0.
- Compute softmax over heads:
1
with 2 for each token 3.
- Re-weight each head: 4
- Concatenate 5: 6
- Project: 7
Key matrices:
| Symbol | Shape | Role |
|---|---|---|
| 8 | 9 | Transform 0 |
| 1 | 2 | Transform 3 |
| 4 | 5 | Final linear head score |
| 6 | 7 | Head score bias |
| 8 | 9 | Output projection |
2.2 Vertical Attention
Given 0 and 1:
- Compute squeezed representation:
2
- Gating vector:
3
- Channel-wise recalibration:
4
Key matrices:
| Symbol | Shape | Role |
|---|---|---|
| 5, 6 | 7 | Linear projections |
| 8 | 9 | Mapping to gating vector |
| 0 | 1 | Channel recalibration bias |
| 2 | 3 (e.g., 4) | Channel squeeze dimension |
3. Integration within Transformer Architectures
HVDA is integrated into the standard Transformer block as a modular augmentation or replacement of the conventional multi-head attention sublayer. The following steps comprise the forward pass:
- Multi-head SDPA: Compute 5 parallel head outputs 6 from the input 7.
- Horizontal Attention: If enabled, compute per-token, per-head weights 8 and recalibrate 9. Concatenate and project to obtain 0.
- Vertical Attention: If enabled, compute channel-wise gating 1 and element-wise modulate 2 to yield 3.
- Residual and Layer Normalization: Apply 4.
- Feed-forward Sublayer: Apply standard feed-forward and residual structure on 5.
A concise pseudocode, as described in the original source (Yu et al., 2022), is:
9
4. Hyperparameters and Configuration
Key hyperparameters for deploying HVDA are:
- 6: Number of SDPA heads, typically unchanged from the baseline Transformer configuration.
- 7: "Channel squeeze" dimension used in vertical attention (8, e.g., 9).
- 0: Key/query and value dimensions per head, commonly set as 1.
No constraints are imposed on the standard architectural parameters of the Transformer aside from the introduction of the extra weights referenced above.
5. Computational Overhead and Complexity
HVDA is characterized by minimal computational and storage overhead:
- The core cost of multi-head SDPA remains 2 per block, requiring 3 parameters.
- Horizontal attention introduces approximately 4 computations per token, with parameter increment of approximately 5 for 6.
- Vertical attention incurs an additional 7 cost per token and about 8 extra parameters for 9.
- Both modules affect only 0 storage. Empirically, the observed floating-point operations (FLOPs) and parameter overhead are less than 1 of the baseline (Yu et al., 2022).
A table summarizing complexity increments:
| Module | Time Complexity | Param Increase (if 2) |
|---|---|---|
| Vanilla SDPA | 3 | 4 |
| + Horizontal | 5 per token | 6 |
| + Vertical | 7 per token | 8 |
This suggests that HVDA can be incorporated into existing architectures with negligible relative resource increase.
6. Modularity and Applicability
HVDA is described as highly modular, enabling insertion into a wide variety of Transformer models to yield performance gains in supervised learning tasks. The augmentation is compatible with vanilla Transformers and does not require modification to the SDPA core or positional encoding. The mechanisms for horizontal and vertical attention can be enabled or disabled independently in each block, facilitating flexible architectural experimentation (Yu et al., 2022).
7. Empirical Observations and Generalization
The authors demonstrate that Transformers equipped with HVDA modules exhibit high generalization capability across different supervised tasks, with only minor increases in computational cost or parameter count. The code for reference implementation is provided in the supplementary material of the original report, enabling straightforward integration and replication. A plausible implication is that these selective re-weighting and channel recalibration mechanisms may enhance feature expressivity and robustness without compromising efficiency (Yu et al., 2022).