Similarity-Guided Per-Head Reuse
- The paper introduces a method that identifies and reuses similar attention heads in multi-head architectures to reduce computational and memory costs.
- It employs similarity metrics like cosine similarity, CKA, and total variation to cluster or pair heads for strategies such as linear fusion and orthonormal alignment.
- Empirical results demonstrate significant efficiency gains in LLMs, achieving up to 75% cache savings and minor accuracy drop in tasks like NLU and QA.
Similarity-guided per-head reuse refers to a family of methods that exploit inter-head redundancy in multi-head attention (MHA) architectures by identifying and reusing similar heads, thereby reducing parameter, memory, or computational cost. Techniques under this paradigm employ similarity metrics and clustering or pairing strategies to determine which heads (or their parameters, activations, or attention patterns) can be efficiently shared or merged. Recent research demonstrates that despite the functional diversity of attention heads, systematic similarities exist and can be leveraged for both compression and acceleration in LLMs.
1. Motivations and Redundancy Analysis
Transformers with MHA contain numerous attention heads per layer, designed to capture diverse token interactions. However, empirical analyses reveal substantial redundancy both within and across layers. For example, in BERT and ViT architectures, the best-matched heads across adjacent layers often have similarity based on total variation of their attention distributions, with even the 5th-best matches maintaining (Bhojanapalli et al., 2021). Similarly, experiments in LLMs show that value and key head weights, as well as their induced attention patterns, cluster into a small number of functional types (Chen et al., 3 Jun 2024, Peng et al., 26 May 2025).
Redundancy manifests as:
- Overlapping attention distributions across heads, with slow decay of similarity with respect to layer distance (Bhojanapalli et al., 2021).
- Highly similar projection matrices among heads, measurable with CKA or cosine similarity (Chen et al., 3 Jun 2024, Cao et al., 19 Feb 2024).
- Consistent block-level sparsity patterns in attention maps, invariant to diverse inputs (Peng et al., 26 May 2025).
- Functional equivalence among groups of heads after orthogonal alignment (Jin et al., 30 Dec 2024).
These findings motivate methods that systematically share, fuse, or erase redundancy at the head-level, guided by measured similarity.
2. Similarity Metrics and Grouping Strategies
Similarity-guided per-head reuse depends crucially on selecting an appropriate metric, as the criterion underpins grouping, sharing, or merging.
- Cosine similarity of concatenated Q/K projection weights (Cao et al., 19 Feb 2024), used for identifying head pairs most amenable to weight sharing. Ablations show that cosine similarity on outperforms metrics based solely on Q, K, or V for attention head reuse.
- Centered Kernel Alignment (CKA) (Chen et al., 3 Jun 2024), provides a robust measure of functional similarity between projection matrices: .
- Total Variation (TV) distance for attention matrices (Bhojanapalli et al., 2021): .
- Jensen–Shannon (JS) divergence between softmax-normalized, block-averaged attention patterns, used for attention map clustering (Peng et al., 26 May 2025).
- Cache-based Euclidean or cosine similarity on real activations during inference, which more faithfully represents functional overlap than weights alone (Jin et al., 30 Dec 2024).
Selection and assignment:
- Pairwise maximum similarity: Heads are paired with the most similar others, up to a user-defined sharing ratio (Cao et al., 19 Feb 2024).
- Clustering: Heads partitioned into groups maximizing intra-group similarity, e.g., via simulated annealing (Chen et al., 3 Jun 2024), hierarchical clustering (Peng et al., 26 May 2025), or value-cache distance minimization (Jin et al., 30 Dec 2024).
- Static schedule: Reused heads statically assigned by index or position (Bhojanapalli et al., 2021).
- Adaptive group allocation: Budgets for key and value head groups set via dynamic programming proportional to preservation of fusion loss (Chen et al., 3 Jun 2024).
3. Per-Head Reuse Mechanisms in Model Architecture
The architectural adaptation for similarity-guided per-head reuse can be categorized as follows:
(a) Parameter-level Sharing and Merging
- Direct Sharing: Identified similar heads directly share all Q/K/V projection matrices (no retraining), as in DirectShare (Cao et al., 19 Feb 2024).
- Similarity-constrained Alignment and Retraining: PostShare adds a regularizer to the training objective to minimize , enforcing Q/K proximity. Shared weights are then indexed by both heads during inference (Cao et al., 19 Feb 2024).
- Linear Fusion: DHA fuses heads within a cluster into a single head via a learned linear combination of their weights: (analogous for ), with fusion weights initialized as one-hot then co-optimized with a Lagrangian for fusion loss minimization (Chen et al., 3 Jun 2024).
- Orthonormal Alignment: Heads within a sharing group are first aligned in an orthogonal subspace (via (Generalized) Procrustes analysis) before averaging or merging to minimize loss from functional misalignment (Jin et al., 30 Dec 2024).
(b) Attention Score and Pattern Reuse
- Score Copying: Some heads use attention scores computed by prior heads; e.g., in layer , certain heads copy (Bhojanapalli et al., 2021).
- Sparse Pattern Sharing: Patterns of sparsity (e.g., block masks covering top- attention mass) are computed for “donor” heads in clusters and shared with all other heads in that cluster. At inference, if , the head adopts the pivotal sparse pattern, reducing attention block computation (Peng et al., 26 May 2025).
(c) Mask-based Pruning
- L₀ Mask Training: Heads are gradually pruned by introducing trainable binary masks ; as , the head’s projection is replaced by the mean of the group. The process continues until all but one head per group is masked (i.e., per-group sharing as in GQA) (Jin et al., 30 Dec 2024).
4. End-to-End Transformation Procedures
A canonical similarity-guided per-head reuse pipeline consists of the following:
- Similarity Measurement: Compute similarity matrix using chosen metric (CKA, cosine, TV, JS, etc.) on projection parameters or cache activations.
- Head Grouping/Pairing: Partition heads per layer into groups or pairs for sharing, respecting global budgets or minimizing fusion-loss.
- Parameter Alignment (Optional): Align heads within each group to a shared subspace via Procrustes analysis if head merging will occur, especially for cache or KV reuse (Jin et al., 30 Dec 2024).
- Weight Fusion/Sharing: Fuse group members' parameters or assign full sharing hooks, possibly via a progressive or staged transformation with retraining or mask annealing.
- Retraining/Continued Pretraining: Fine-tune the model post-sharing (often for substantially less than 1% of original pretraining tokens) to recover any lost accuracy (Chen et al., 3 Jun 2024, Cao et al., 19 Feb 2024).
- Inference Mapping: At inference, both members of a sharing group index into the same weights, donor pattern, or merged cache entry.
Notably, methods such as DHA (Chen et al., 3 Jun 2024) and PostShare (Cao et al., 19 Feb 2024) emphasize preserving representational function throughout the transformation. Sparse Pattern Sharing (Peng et al., 26 May 2025) maintains a dynamic global dictionary of cluster-wise donor patterns for efficient pattern distribution.
5. Empirical Outcomes and Efficiency-Performance Tradeoffs
Similarity-guided per-head reuse achieves significant resource reductions with minor loss in model quality, provided redundancy is carefully measured and mitigated. Key empirical results include:
| Method | Typical Head Budget | Performance Recovery | Memory/Compute Savings | Unique Properties |
|---|---|---|---|---|
| PostShare (Cao et al., 19 Feb 2024) | 30% (sharing ratio) | ~87.5% of NLU/QA base; +8–12 pts vs. naive sharing | Linear in sharing ratio (QKV) | Pairwise cosine-sim, post-training, supports LLAMA 13B |
| DHA (Chen et al., 3 Jun 2024) | 25%–50% | 97.6% downstream on LLMs; >GQA | 75% KV cache, pretrain | CKA clustering, linear fusion, adaptive K/V group sizes |
| Head Alignment + GQA (Jin et al., 30 Dec 2024) | 12.5%–50% | higher at high compression | up to 87.5% KV cache | Generalized Procrustes alignment before mask pruning |
| Reuse Transformer (Bhojanapalli et al., 2021) | ML/GLUE/ViT nearly baseline, slight BLEU | 10–20% compute, 6–18% memory | Static assignment; per-layer & per-head schedules | |
| SharePrefill (Peng et al., 26 May 2025) | donor heads | Perplexity & accuracy within $1.0$ | 2–5× latency reduction | Pattern clustering; blockwise pattern sharing |
For instance, PostShare at on Llama 2-7B recovers 87.5% of base task accuracy after retraining, compared to ≈79% for direct sharing. Even a 4-point BLEU drop is observed for 50% sharing on GPT-2-small with no retraining, highlighting generalization to small models as well (Cao et al., 19 Feb 2024). DHA recovers >97% of full model performance on challenging LLM tasks with 75% head group reduction and uses only 0.25% of the original pre-training budget (Chen et al., 3 Jun 2024). Alignment-based GQA conversion yields up to 1.7% absolute improvement versus naive pooling at 75% compression, especially when grouping by value-cache similarity and pre-aligning by Procrustes (Jin et al., 30 Dec 2024). In long-context inference, SharePrefill achieves 2–5× speedup with only 1–4 donor heads per layer and minimal accuracy loss (Peng et al., 26 May 2025).
6. Model Variants, Limitations, and Practical Considerations
Variants span head-pairing, group-based fusion (linear or via pooling), full-score copying, and pattern sharing. Practical insights and limitations include:
- No universal clustering: Grouping is typically per-layer, with no cross-layer sharing (Cao et al., 19 Feb 2024).
- Retraining sensitivity: Post-sharing fine-tuning must be carefully scheduled; downstream tasks peak at different retraining steps, requiring tuning of and epoch count to avoid overfitting (Cao et al., 19 Feb 2024).
- Static vs. dynamic schedules: Static selection of reuse heads (by index or position) is simpler but less adaptive than cluster- or similarity-guided strategies (Bhojanapalli et al., 2021).
- Feed-forward redundancy untapped: Most work focuses on MHA blocks. FFN sublayers dominate parameter count but have not been systematically studied for similarity-guided sharing (Cao et al., 19 Feb 2024).
- KV alignment for RoPE: For compatibility with rotary position embedding, orthogonal group alignment is performed in 2×2 blocks, preserving functional equivalence (Jin et al., 30 Dec 2024).
- Cache/activation similarity preferred: Measure similarity on activation caches rather than raw weights to capture true functional proximity (Jin et al., 30 Dec 2024).
- Dynamic adaptivity: Future extensions could enable on-the-fly, per-example reuse decisions, but most current methods operate with statically assigned sharing groups (Bhojanapalli et al., 2021).
7. Extensions and Future Directions
Proposed extensions to similarity-guided per-head reuse include:
- Learnable gating or routing: Adaptive gating networks can interpolate between fresh and reused heads per example (Bhojanapalli et al., 2021).
- Offline clustering for more expressive assignments: Hierarchical or distance-threshold clustering, possibly with knowledge distillation to guide group formation (Peng et al., 26 May 2025).
- Dynamic head budgets: Allocate sharing ratios per layer according to local fusion or alignment loss (Chen et al., 3 Jun 2024).
- Extending fusion/sharing to cross-attention blocks, FFNs, or multi-modal architectures.
- Integration with sparse attention and blockwise computation: As in SharePrefill, combining per-head pattern sharing with blockwise pruning for long-context LLMs (Peng et al., 26 May 2025).
These developments affirm the central role of measured similarity in uncovering functional redundancy and guiding efficient architectural transformations in attention-based models.