Papers
Topics
Authors
Recent
Search
2000 character limit reached

HVDA: Horizontal-Vertical Detail Attention

Updated 27 June 2026
  • HVDA is a novel module that integrates horizontal recalibration and vertical channel gating to efficiently enhance transformer token dependencies.
  • Horizontal attention employs learned re-weighting of multi-head outputs before projection to prioritize informative features.
  • Vertical attention applies channel-wise gating post-projection to refine feature representation with minimal computational overhead.

Horizontal–Vertical Detail Attention (HVDA) is an architectural module designed to augment self-attention mechanisms in Transformers by incorporating two orthogonal attention strategies: horizontal attention, which recalibrates multi-head outputs before projection, and vertical attention, which re-weights feature channels post-projection via explicit channel-wise modelling. Both mechanisms aim to enrich feature representation and token dependency modeling with negligible computational and parameter overhead, and can be modularly integrated into standard Transformer blocks (Yu et al., 2022).

1. Formal Definitions and Module Motivation

Let X∈Rn×DX \in \mathbb{R}^{n \times D} denote the sequence of nn input tokens, each with DD-dimensional features. Multi-head Scaled Dot-Product Attention (SDPA) produces MM parallel head outputs H1,…,HMH_1,\ldots,H_M, Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}, with typical dimensionalities Dk=Dv=D/MD_k = D_v = D/M per head.

  • Horizontal attention introduces a learned re-weighting vector α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M} (per token) to emphasize "more informative" heads before the usual linear projection, replacing the concatenation and projection step with:

Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M

instead of Concat(H1,…,HM)WM\text{Concat}(H_1, \ldots, H_M) W^M.

  • Vertical attention recalibrates the nn0 channels of nn1 (the projected multi-head output) via a channel-wise gating vector nn2, yielding rescaled output nn3.

The stated motivation is to enhance representation distinctiveness, model informative head outputs more selectively, and capture inter-channel dependencies (Yu et al., 2022).

2. Mathematical Formulation

The HVDA module formalizes attention recalibration via explicit learned transformations.

2.1 Horizontal Attention

Given nn4 and nn5:

  • For each head nn6:

nn7

nn8

  • Stack nn9 across heads: DD0.
  • Compute softmax over heads:

DD1

with DD2 for each token DD3.

  • Re-weight each head: DD4
  • Concatenate DD5: DD6
  • Project: DD7

Key matrices:

Symbol Shape Role
DD8 DD9 Transform MM0
MM1 MM2 Transform MM3
MM4 MM5 Final linear head score
MM6 MM7 Head score bias
MM8 MM9 Output projection

2.2 Vertical Attention

Given H1,…,HMH_1,\ldots,H_M0 and H1,…,HMH_1,\ldots,H_M1:

  • Compute squeezed representation:

H1,…,HMH_1,\ldots,H_M2

  • Gating vector:

H1,…,HMH_1,\ldots,H_M3

  • Channel-wise recalibration:

H1,…,HMH_1,\ldots,H_M4

Key matrices:

Symbol Shape Role
H1,…,HMH_1,\ldots,H_M5, H1,…,HMH_1,\ldots,H_M6 H1,…,HMH_1,\ldots,H_M7 Linear projections
H1,…,HMH_1,\ldots,H_M8 H1,…,HMH_1,\ldots,H_M9 Mapping to gating vector
Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}0 Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}1 Channel recalibration bias
Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}2 Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}3 (e.g., Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}4) Channel squeeze dimension

3. Integration within Transformer Architectures

HVDA is integrated into the standard Transformer block as a modular augmentation or replacement of the conventional multi-head attention sublayer. The following steps comprise the forward pass:

  1. Multi-head SDPA: Compute Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}5 parallel head outputs Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}6 from the input Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}7.
  2. Horizontal Attention: If enabled, compute per-token, per-head weights Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}8 and recalibrate Hm∈Rn×DvH_m \in \mathbb{R}^{n \times D_v}9. Concatenate and project to obtain Dk=Dv=D/MD_k = D_v = D/M0.
  3. Vertical Attention: If enabled, compute channel-wise gating Dk=Dv=D/MD_k = D_v = D/M1 and element-wise modulate Dk=Dv=D/MD_k = D_v = D/M2 to yield Dk=Dv=D/MD_k = D_v = D/M3.
  4. Residual and Layer Normalization: Apply Dk=Dv=D/MD_k = D_v = D/M4.
  5. Feed-forward Sublayer: Apply standard feed-forward and residual structure on Dk=Dv=D/MD_k = D_v = D/M5.

A concise pseudocode, as described in the original source (Yu et al., 2022), is:

Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M9

4. Hyperparameters and Configuration

Key hyperparameters for deploying HVDA are:

  • Dk=Dv=D/MD_k = D_v = D/M6: Number of SDPA heads, typically unchanged from the baseline Transformer configuration.
  • Dk=Dv=D/MD_k = D_v = D/M7: "Channel squeeze" dimension used in vertical attention (Dk=Dv=D/MD_k = D_v = D/M8, e.g., Dk=Dv=D/MD_k = D_v = D/M9).
  • α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}0: Key/query and value dimensions per head, commonly set as α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}1.

No constraints are imposed on the standard architectural parameters of the Transformer aside from the introduction of the extra weights referenced above.

5. Computational Overhead and Complexity

HVDA is characterized by minimal computational and storage overhead:

  • The core cost of multi-head SDPA remains α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}2 per block, requiring α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}3 parameters.
  • Horizontal attention introduces approximately α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}4 computations per token, with parameter increment of approximately α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}5 for α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}6.
  • Vertical attention incurs an additional α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}7 cost per token and about α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}8 extra parameters for α=[α1,…,αM]∈Rn×M\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}9.
  • Both modules affect only Concat(α1â‹…H1,…,αMâ‹…HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M0 storage. Empirically, the observed floating-point operations (FLOPs) and parameter overhead are less than Concat(α1â‹…H1,…,αMâ‹…HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M1 of the baseline (Yu et al., 2022).

A table summarizing complexity increments:

Module Time Complexity Param Increase (if Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M2)
Vanilla SDPA Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M3 Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M4
+ Horizontal Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M5 per token Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M6
+ Vertical Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M7 per token Concat(α1⋅H1,…,αM⋅HM)WM\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M8

This suggests that HVDA can be incorporated into existing architectures with negligible relative resource increase.

6. Modularity and Applicability

HVDA is described as highly modular, enabling insertion into a wide variety of Transformer models to yield performance gains in supervised learning tasks. The augmentation is compatible with vanilla Transformers and does not require modification to the SDPA core or positional encoding. The mechanisms for horizontal and vertical attention can be enabled or disabled independently in each block, facilitating flexible architectural experimentation (Yu et al., 2022).

7. Empirical Observations and Generalization

The authors demonstrate that Transformers equipped with HVDA modules exhibit high generalization capability across different supervised tasks, with only minor increases in computational cost or parameter count. The code for reference implementation is provided in the supplementary material of the original report, enabling straightforward integration and replication. A plausible implication is that these selective re-weighting and channel recalibration mechanisms may enhance feature expressivity and robustness without compromising efficiency (Yu et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horizontal-Vertical Detail Attention (HVDA).