HVDA: Horizontal-Vertical Detail Attention

Updated 27 June 2026

HVDA is a novel module that integrates horizontal recalibration and vertical channel gating to efficiently enhance transformer token dependencies.
Horizontal attention employs learned re-weighting of multi-head outputs before projection to prioritize informative features.
Vertical attention applies channel-wise gating post-projection to refine feature representation with minimal computational overhead.

Horizontal–Vertical Detail Attention (HVDA) is an architectural module designed to augment self-attention mechanisms in Transformers by incorporating two orthogonal attention strategies: horizontal attention, which recalibrates multi-head outputs before projection, and vertical attention, which re-weights feature channels post-projection via explicit channel-wise modelling. Both mechanisms aim to enrich feature representation and token dependency modeling with negligible computational and parameter overhead, and can be modularly integrated into standard Transformer blocks (Yu et al., 2022).

1. Formal Definitions and Module Motivation

Let $X \in \mathbb{R}^{n \times D}$ denote the sequence of $n$ input tokens, each with $D$ -dimensional features. Multi-head Scaled Dot-Product Attention (SDPA) produces $M$ parallel head outputs $H_1,\ldots,H_M$ , $H_m \in \mathbb{R}^{n \times D_v}$ , with typical dimensionalities $D_k = D_v = D/M$ per head.

Horizontal attention introduces a learned re-weighting vector $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ (per token) to emphasize "more informative" heads before the usual linear projection, replacing the concatenation and projection step with:

$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$

instead of $\text{Concat}(H_1, \ldots, H_M) W^M$ .

Vertical attention recalibrates the $n$ 0 channels of $n$ 1 (the projected multi-head output) via a channel-wise gating vector $n$ 2, yielding rescaled output $n$ 3.

The stated motivation is to enhance representation distinctiveness, model informative head outputs more selectively, and capture inter-channel dependencies (Yu et al., 2022).

2. Mathematical Formulation

The HVDA module formalizes attention recalibration via explicit learned transformations.

2.1 Horizontal Attention

Given $n$ 4 and $n$ 5:

For each head $n$ 6:

$n$ 7

$n$ 8

Stack $n$ 9 across heads: $D$ 0.
Compute softmax over heads:

$D$ 1

with $D$ 2 for each token $D$ 3.

Re-weight each head: $D$ 4
Concatenate $D$ 5: $D$ 6
Project: $D$ 7

Key matrices:

Symbol	Shape	Role
$D$ 8	$D$ 9	Transform $M$ 0
$M$ 1	$M$ 2	Transform $M$ 3
$M$ 4	$M$ 5	Final linear head score
$M$ 6	$M$ 7	Head score bias
$M$ 8	$M$ 9	Output projection

2.2 Vertical Attention

Given $H_1,\ldots,H_M$ 0 and $H_1,\ldots,H_M$ 1:

Compute squeezed representation:

$H_1,\ldots,H_M$ 2

Gating vector:

$H_1,\ldots,H_M$ 3

Channel-wise recalibration:

$H_1,\ldots,H_M$ 4

Key matrices:

Symbol	Shape	Role
$H_1,\ldots,H_M$ 5, $H_1,\ldots,H_M$ 6	$H_1,\ldots,H_M$ 7	Linear projections
$H_1,\ldots,H_M$ 8	$H_1,\ldots,H_M$ 9	Mapping to gating vector
$H_m \in \mathbb{R}^{n \times D_v}$ 0	$H_m \in \mathbb{R}^{n \times D_v}$ 1	Channel recalibration bias
$H_m \in \mathbb{R}^{n \times D_v}$ 2	$H_m \in \mathbb{R}^{n \times D_v}$ 3 (e.g., $H_m \in \mathbb{R}^{n \times D_v}$ 4)	Channel squeeze dimension

3. Integration within Transformer Architectures

HVDA is integrated into the standard Transformer block as a modular augmentation or replacement of the conventional multi-head attention sublayer. The following steps comprise the forward pass:

Multi-head SDPA: Compute $H_m \in \mathbb{R}^{n \times D_v}$ 5 parallel head outputs $H_m \in \mathbb{R}^{n \times D_v}$ 6 from the input $H_m \in \mathbb{R}^{n \times D_v}$ 7.
Horizontal Attention: If enabled, compute per-token, per-head weights $H_m \in \mathbb{R}^{n \times D_v}$ 8 and recalibrate $H_m \in \mathbb{R}^{n \times D_v}$ 9. Concatenate and project to obtain $D_k = D_v = D/M$ 0.
Vertical Attention: If enabled, compute channel-wise gating $D_k = D_v = D/M$ 1 and element-wise modulate $D_k = D_v = D/M$ 2 to yield $D_k = D_v = D/M$ 3.
Residual and Layer Normalization: Apply $D_k = D_v = D/M$ 4.
Feed-forward Sublayer: Apply standard feed-forward and residual structure on $D_k = D_v = D/M$ 5.

A concise pseudocode, as described in the original source (Yu et al., 2022), is:

$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 9

4. Hyperparameters and Configuration

Key hyperparameters for deploying HVDA are:

$D_k = D_v = D/M$ 6: Number of SDPA heads, typically unchanged from the baseline Transformer configuration.
$D_k = D_v = D/M$ 7: "Channel squeeze" dimension used in vertical attention ( $D_k = D_v = D/M$ 8, e.g., $D_k = D_v = D/M$ 9).
$\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 0: Key/query and value dimensions per head, commonly set as $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 1.

No constraints are imposed on the standard architectural parameters of the Transformer aside from the introduction of the extra weights referenced above.

5. Computational Overhead and Complexity

HVDA is characterized by minimal computational and storage overhead:

The core cost of multi-head SDPA remains $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 2 per block, requiring $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 3 parameters.
Horizontal attention introduces approximately $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 4 computations per token, with parameter increment of approximately $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 5 for $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 6.
Vertical attention incurs an additional $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 7 cost per token and about $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 8 extra parameters for $\alpha = [\alpha_1, \ldots, \alpha_M] \in \mathbb{R}^{n \times M}$ 9.
Both modules affect only $\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 0 storage. Empirically, the observed floating-point operations (FLOPs) and parameter overhead are less than $\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 1 of the baseline (Yu et al., 2022).

A table summarizing complexity increments:

Module	Time Complexity	Param Increase (if $\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 2)
Vanilla SDPA	$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 3	$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 4
+ Horizontal	$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 5 per token	$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 6
+ Vertical	$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 7 per token	$\text{Concat}(\alpha_1 \cdot H_1, \ldots, \alpha_M \cdot H_M) W^M$ 8

This suggests that HVDA can be incorporated into existing architectures with negligible relative resource increase.

6. Modularity and Applicability

HVDA is described as highly modular, enabling insertion into a wide variety of Transformer models to yield performance gains in supervised learning tasks. The augmentation is compatible with vanilla Transformers and does not require modification to the SDPA core or positional encoding. The mechanisms for horizontal and vertical attention can be enabled or disabled independently in each block, facilitating flexible architectural experimentation (Yu et al., 2022).

7. Empirical Observations and Generalization

The authors demonstrate that Transformers equipped with HVDA modules exhibit high generalization capability across different supervised tasks, with only minor increases in computational cost or parameter count. The code for reference implementation is provided in the supplementary material of the original report, enabling straightforward integration and replication. A plausible implication is that these selective re-weighting and channel recalibration mechanisms may enhance feature expressivity and robustness without compromising efficiency (Yu et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Horizontal and Vertical Attention in Transformers (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horizontal-Vertical Detail Attention (HVDA).