Papers
Topics
Authors
Recent
2000 character limit reached

Similarity-Guided Per-Head Reuse

Updated 13 December 2025
  • The paper introduces a method that identifies and reuses similar attention heads in multi-head architectures to reduce computational and memory costs.
  • It employs similarity metrics like cosine similarity, CKA, and total variation to cluster or pair heads for strategies such as linear fusion and orthonormal alignment.
  • Empirical results demonstrate significant efficiency gains in LLMs, achieving up to 75% cache savings and minor accuracy drop in tasks like NLU and QA.

Similarity-guided per-head reuse refers to a family of methods that exploit inter-head redundancy in multi-head attention (MHA) architectures by identifying and reusing similar heads, thereby reducing parameter, memory, or computational cost. Techniques under this paradigm employ similarity metrics and clustering or pairing strategies to determine which heads (or their parameters, activations, or attention patterns) can be efficiently shared or merged. Recent research demonstrates that despite the functional diversity of attention heads, systematic similarities exist and can be leveraged for both compression and acceleration in LLMs.

1. Motivations and Redundancy Analysis

Transformers with MHA contain numerous attention heads per layer, designed to capture diverse token interactions. However, empirical analyses reveal substantial redundancy both within and across layers. For example, in BERT and ViT architectures, the best-matched heads across adjacent layers often have similarity S0.80.9S \approx 0.8-0.9 based on total variation of their attention distributions, with even the 5th-best matches maintaining S0.60.7S \approx 0.6-0.7 (Bhojanapalli et al., 2021). Similarly, experiments in LLMs show that value and key head weights, as well as their induced attention patterns, cluster into a small number of functional types (Chen et al., 3 Jun 2024, Peng et al., 26 May 2025).

Redundancy manifests as:

These findings motivate methods that systematically share, fuse, or erase redundancy at the head-level, guided by measured similarity.

2. Similarity Metrics and Grouping Strategies

Similarity-guided per-head reuse depends crucially on selecting an appropriate metric, as the criterion underpins grouping, sharing, or merging.

  • Cosine similarity of concatenated Q/K projection weights (Cao et al., 19 Feb 2024), used for identifying head pairs most amenable to weight sharing. Ablations show that cosine similarity on [WqWk][W^q \| W^k] outperforms metrics based solely on Q, K, or V for attention head reuse.
  • Centered Kernel Alignment (CKA) (Chen et al., 3 Jun 2024), provides a robust measure of functional similarity between projection matrices: CKA(Wi,Wj)=WiTWjF2WiTWiF2WjTWjF2\mathrm{CKA}(W_i, W_j) = \frac{\|\,W_i^T W_j\|_F^2}{\sqrt{\|\,W_i^T W_i\|_F^2 \cdot \|\,W_j^T W_j\|_F^2}}.
  • Total Variation (TV) distance for attention matrices (Bhojanapalli et al., 2021): S(A,B)=11np=1n12A[p,:]B[p,:]1S(A,B) = 1 - \frac{1}{n}\sum_{p=1}^n \frac{1}{2}\|A[p,:] - B[p,:]\|_1.
  • Jensen–Shannon (JS) divergence between softmax-normalized, block-averaged attention patterns, used for attention map clustering (Peng et al., 26 May 2025).
  • Cache-based Euclidean or cosine similarity on real activations during inference, which more faithfully represents functional overlap than weights alone (Jin et al., 30 Dec 2024).

Selection and assignment:

3. Per-Head Reuse Mechanisms in Model Architecture

The architectural adaptation for similarity-guided per-head reuse can be categorized as follows:

(a) Parameter-level Sharing and Merging

  • Direct Sharing: Identified similar heads directly share all Q/K/V projection matrices (no retraining), as in DirectShare (Cao et al., 19 Feb 2024).
  • Similarity-constrained Alignment and Retraining: PostShare adds a regularizer to the training objective to minimize Lshare=(i,j)S[WiqWik][WjqWjk]F2L_{\text{share}} = \sum_{(i,j)\in S}\|\left[W^q_i\|W^k_i\right] - [W^q_j\|W^k_j]\|_F^2, enforcing Q/K proximity. Shared weights are then indexed by both heads during inference (Cao et al., 19 Feb 2024).
  • Linear Fusion: DHA fuses heads within a cluster into a single head via a learned linear combination of their weights: Wkfused=jωjWk(j)W_k^\text{fused} = \sum_j \omega_j W_k^{(j)} (analogous for WvW_v), with fusion weights initialized as one-hot then co-optimized with a Lagrangian for fusion loss minimization (Chen et al., 3 Jun 2024).
  • Orthonormal Alignment: Heads within a sharing group are first aligned in an orthogonal subspace (via (Generalized) Procrustes analysis) before averaging or merging to minimize loss from functional misalignment (Jin et al., 30 Dec 2024).

(b) Attention Score and Pattern Reuse

  • Score Copying: Some heads use attention scores computed by prior heads; e.g., in layer ll, certain heads copy Ah(l):=Ah(l1)A^{(l)}_h := A^{(l-1)}_{h'} (Bhojanapalli et al., 2021).
  • Sparse Pattern Sharing: Patterns of sparsity (e.g., block masks covering top-γ\gamma attention mass) are computed for “donor” heads in clusters and shared with all other heads in that cluster. At inference, if dsim<τd_{\text{sim}} < \tau, the head adopts the pivotal sparse pattern, reducing attention block computation (Peng et al., 26 May 2025).

(c) Mask-based Pruning

  • L₀ Mask Training: Heads are gradually pruned by introducing trainable binary masks zk,jz_{k,j}; as zk,j0z_{k,j}\to0, the head’s projection is replaced by the mean of the group. The process continues until all but one head per group is masked (i.e., per-group sharing as in GQA) (Jin et al., 30 Dec 2024).

4. End-to-End Transformation Procedures

A canonical similarity-guided per-head reuse pipeline consists of the following:

  1. Similarity Measurement: Compute H×HH\times H similarity matrix using chosen metric (CKA, cosine, TV, JS, etc.) on projection parameters or cache activations.
  2. Head Grouping/Pairing: Partition heads per layer into groups or pairs for sharing, respecting global budgets or minimizing fusion-loss.
  3. Parameter Alignment (Optional): Align heads within each group to a shared subspace via Procrustes analysis if head merging will occur, especially for cache or KV reuse (Jin et al., 30 Dec 2024).
  4. Weight Fusion/Sharing: Fuse group members' parameters or assign full sharing hooks, possibly via a progressive or staged transformation with retraining or mask annealing.
  5. Retraining/Continued Pretraining: Fine-tune the model post-sharing (often for substantially less than 1% of original pretraining tokens) to recover any lost accuracy (Chen et al., 3 Jun 2024, Cao et al., 19 Feb 2024).
  6. Inference Mapping: At inference, both members of a sharing group index into the same weights, donor pattern, or merged cache entry.

Notably, methods such as DHA (Chen et al., 3 Jun 2024) and PostShare (Cao et al., 19 Feb 2024) emphasize preserving representational function throughout the transformation. Sparse Pattern Sharing (Peng et al., 26 May 2025) maintains a dynamic global dictionary of cluster-wise donor patterns for efficient pattern distribution.

5. Empirical Outcomes and Efficiency-Performance Tradeoffs

Similarity-guided per-head reuse achieves significant resource reductions with minor loss in model quality, provided redundancy is carefully measured and mitigated. Key empirical results include:

Method Typical Head Budget Performance Recovery Memory/Compute Savings Unique Properties
PostShare (Cao et al., 19 Feb 2024) 30% (sharing ratio) ~87.5% of NLU/QA base; +8–12 pts vs. naive sharing Linear in sharing ratio (QKV) Pairwise cosine-sim, post-training, supports LLAMA 13B
DHA (Chen et al., 3 Jun 2024) 25%–50% 97.6% downstream on LLMs; >GQA 75% KV cache, 0.25%0.25\% pretrain CKA clustering, linear fusion, adaptive K/V group sizes
Head Alignment + GQA (Jin et al., 30 Dec 2024) 12.5%–50% 0.21.7%0.2{-}1.7\% higher at high compression up to 87.5% KV cache Generalized Procrustes alignment before mask pruning
Reuse Transformer (Bhojanapalli et al., 2021) K=412,P=612K=4{-}12, P=6{-}12 ML/GLUE/ViT nearly baseline, slight Δ\Delta BLEU 10–20% compute, 6–18% memory Static assignment; per-layer & per-head schedules
SharePrefill (Peng et al., 26 May 2025) <H/16<H/16 donor heads Perplexity & accuracy within $1.0$ 2–5× latency reduction Pattern clustering; blockwise pattern sharing

For instance, PostShare at γ=0.3\gamma=0.3 on Llama 2-7B recovers 87.5% of base task accuracy after retraining, compared to ≈79% for direct sharing. Even a 4-point BLEU drop is observed for 50% sharing on GPT-2-small with no retraining, highlighting generalization to small models as well (Cao et al., 19 Feb 2024). DHA recovers >97% of full model performance on challenging LLM tasks with 75% head group reduction and uses only 0.25% of the original pre-training budget (Chen et al., 3 Jun 2024). Alignment-based GQA conversion yields up to 1.7% absolute improvement versus naive pooling at 75% compression, especially when grouping by value-cache similarity and pre-aligning by Procrustes (Jin et al., 30 Dec 2024). In long-context inference, SharePrefill achieves 2–5× speedup with only 1–4 donor heads per layer and minimal accuracy loss (Peng et al., 26 May 2025).

6. Model Variants, Limitations, and Practical Considerations

Variants span head-pairing, group-based fusion (linear or via pooling), full-score copying, and pattern sharing. Practical insights and limitations include:

  • No universal clustering: Grouping is typically per-layer, with no cross-layer sharing (Cao et al., 19 Feb 2024).
  • Retraining sensitivity: Post-sharing fine-tuning must be carefully scheduled; downstream tasks peak at different retraining steps, requiring tuning of λ\lambda and epoch count to avoid overfitting (Cao et al., 19 Feb 2024).
  • Static vs. dynamic schedules: Static selection of reuse heads (by index or position) is simpler but less adaptive than cluster- or similarity-guided strategies (Bhojanapalli et al., 2021).
  • Feed-forward redundancy untapped: Most work focuses on MHA blocks. FFN sublayers dominate parameter count but have not been systematically studied for similarity-guided sharing (Cao et al., 19 Feb 2024).
  • KV alignment for RoPE: For compatibility with rotary position embedding, orthogonal group alignment is performed in 2×2 blocks, preserving functional equivalence (Jin et al., 30 Dec 2024).
  • Cache/activation similarity preferred: Measure similarity on activation caches rather than raw weights to capture true functional proximity (Jin et al., 30 Dec 2024).
  • Dynamic adaptivity: Future extensions could enable on-the-fly, per-example reuse decisions, but most current methods operate with statically assigned sharing groups (Bhojanapalli et al., 2021).

7. Extensions and Future Directions

Proposed extensions to similarity-guided per-head reuse include:

These developments affirm the central role of measured similarity in uncovering functional redundancy and guiding efficient architectural transformations in attention-based models.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Similarity-Guided Per-Head Reuse.