Knocking-Heads Attention (KHA)

Updated 5 February 2026

KHA is a multi-head attention enhancement that uses shared, diagonally-initialized projections to enable direct cross-head feature interaction.
It preserves head-specific expressiveness by mitigating the low-rank bottleneck that occurs when increasing the number of attention heads.
Empirical evaluations show improved training stability and reduced loss with minimal computational overhead, making it versatile for various attention architectures.

Knocking-Heads Attention (KHA) is an enhancement for multi-head attention (MHA) architectures, enabling lightweight and direct cross-head feature interactions prior to the scaling dot-product operation. KHA addresses the representational trade-offs inherent in classical MHA as the number of attention heads increases—specifically, the loss of per-head expressiveness due to the reduction in dimensionality. By introducing shared, diagonally-initialized projection matrices across all heads, KHA preserves head specificity at initialization but allows feature-level communication, leading to improved expressiveness and training stability with minimal computational overhead (Zhou et al., 27 Oct 2025).

1. Motivation and Context

Classical MHA architectures, foundational to modern LLMs, decompose their input $X \in \mathbb{R}^{L \times d}$ into $n$ attention heads of dimension $d_k = d / n$ . An increased number of heads is believed to offer finer-grained relational modeling, yet each head consequently operates in a lower-rank subspace, causing a loss in representational power—a phenomenon highlighted in the literature as a low-rank bottleneck. Furthermore, traditional MHA and prominent variants such as Grouped-Query Attention (GQA) and Grouped-Tied Attention (GTA) handle each head in isolation, concatenating outputs without inter-head feature mixing. Methods like Talking-Heads Attention introduce mixing at the attention logits or post-softmax level but at the expense of quadratic overhead and limited scalability. CollabHead applies large shared projections but at the cost of specialization and increased FLOPs.

Knocking-Heads Attention circumvents these drawbacks by integrating a shared, lightweight projection in the per-head feature pipeline, effecting $O(d^2/n)$ additional computation. This arrangement maintains head specialization while enabling beneficial cross-head communication (Zhou et al., 27 Oct 2025).

2. Formal Description and Architectural Details

Given $X \in \mathbb{R}^{L\times d}$ and $n$ heads, KHA first applies standard head-specific projections:

$Q_i = X W_i^Q, \quad K_i = X W_i^K, \quad V_i = X W_i^V \qquad (W_i^* \in \mathbb{R}^{d \times d_k})$

In KHA-Linear, three shared matrices $T^Q$ , $T^K \in \mathbb{R}^{d_k \times d_k}$ , and $T^V \in \mathbb{R}^{d_v \times d_v}$ are introduced and applied post projection but before attention computation:

$\tilde{Q}_i = Q_i T^Q, \quad \tilde{K}_i = K_i T^K, \quad \tilde{V}_i = V_i T^V$

Attention proceeds as in standard MHA but on these transformed representations, and the same $T^*$ matrices are reused across all heads, enforcing feature-level coupling. For inference, $T^*$ can be fused into $W_i^*$ , incurring no additional inference cost.

In KHA-MLP, the shared $T^V$ is replaced with a small, gated MLP:

$\tilde{V}_i = \mathrm{MLP}(V_i) = 2(V_i W^{\mathrm{up}} \odot \sigma(V_i W^{\mathrm{gate}})) W^{\mathrm{down}}$

where $W^{\mathrm{up}}$ , $W^{\mathrm{gate}}$ , $W^{\mathrm{down}} \in \mathbb{R}^{d_v \times d_v}$ are shared and $\sigma(\cdot)$ denotes the sigmoid.

3. Initialization and Training Dynamics

KHA employs diagonal initialization to preserve head-specific inductive bias. For KHA-Linear, $T^Q$ , $T^K$ , $T^V$ are initialized as identity matrices plus small positive diagonal perturbations; for KHA-MLP, $W^{\mathrm{up}}$ and $W^{\mathrm{down}}$ are initialized as diagonal, and $W^{\mathrm{gate}}$ as zeros, ensuring the initial forward pass is equivalent to unmodified MHA. As training progresses, gradients activate off-diagonal elements and $W^{\mathrm{gate}}$ , progressively enabling cross-head feature blending without abrupt function drift (Zhou et al., 27 Oct 2025).

4. Integration with Attention Variants

KHA’s head-sharing projection is inserted directly after per-head linear maps for $Q$ , $K$ , $V$ , making it compatible with any softmax-based multi-head framework including MHA, GQA, GTA, multi-query attention (MQA), and multi-linear attention (MLA). For example, within GQA, $T^V$ is applied to each group’s $V$ projection; within GTA, to the shared $V$ projection. No architectural changes are required beyond the introduction of the shared projection.

A concise pseudocode for KHA-Linear:

def KHA_Attention(X):
    for each head i in 1...n:
        Q_i = X @ W_i^Q
        K_i = X @ W_i^K
        V_i = X @ W_i^V

        tilde_Q_i = Q_i @ T^Q
        tilde_K_i = K_i @ T^K
        tilde_V_i = V_i @ T^V

        O_i = softmax(tilde_Q_i @ tilde_K_i^T / sqrt(d_k)) @ tilde_V_i

    return concat(O_1...O_n) @ W^O

(Zhou et al., 27 Oct 2025)

5. Computational and Parameter Overhead

KHA introduces negligible resource requirements relative to standard MHA. For KHA-Linear, the additional training computation is $6 L d^2 / n$ FLOPs per layer, corresponding to only 0.55% of full layer compute and 1.17% of MHA’s FLOPs for typical configurations ( $L=2048$ , $d=1024$ , $n=32$ ). Parameter overhead is $O(d^2/n)$ —under 1% of attention-block parameters. At inference, projections may be statically fused, resulting in zero additional runtime cost (Zhou et al., 27 Oct 2025).

6. Empirical Evaluation and Results

KHA demonstrated substantial improvements in large-scale training regimes. A 6.1B parameter Mixture-of-Experts (MoE) model—1.01B parameters active per token, with GQA (32 query heads, 4 KV groups) and head size 128—was trained on 1T high-quality tokens using the FSDP-based Adam optimizer. KHA, notably the MLP variant applied to $V$ , reduced the occurrence and severity of loss spikes, delivering a consistent 0.015 lower loss in the stabilized regime.

Results summary:

Setting	KHA Effect (Loss / Downstream)
More KV heads (1→4, Table 2)	KHA-MLP lowers loss up to 0.024; KHA-Linear smaller benefit
Projection ablation (Table 3)	$V$ projection is most effective; MLP > linear; gating alone harmful
Attention variant compatibility	Loss reductions in MHA, GQA, GTA, MQA, MLA ( $-0.010$ to $-0.020$ )
Downstream task metrics (Table 5)	RACE +4.32, Code +3.90, Math +1.62, Overall +1.26 pts (across 30+)

Empirical results indicate KHA is most effective when cross-head transformations are applied to $V$ via a shared MLP, though $Q$ / $K$ projections yield marginal gains (Zhou et al., 27 Oct 2025).

7. Interpretation, Limitations, and Prospective Extensions

KHA’s improvements arise from two factors: (1) increased expressiveness via statistical strength-sharing among heads, mitigating low-rank constraints; (2) implicit regularization, as diagonal initialization stabilizes early training and reduces catastrophic gradient spikes.

Limitations include variable gains across tasks, with minor impact on general-knowledge benchmarks. Further research could address task-specific gating or sparsity, richer nonlinear sharing architectures (such as intra-MLP cross-head attention), and adaptive per-layer or per-token sharing to optimize specialization versus collaboration. Extending the diagonal-initialized head-sharing principle to other neural components, including convolutional, recurrent, or retrieval-augmented modules, remains an open avenue (Zhou et al., 27 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Knocking-Heads Attention (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Knocking-Heads Attention (KHA).