Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Layer and Head-Wise Reuse in Transformers

Updated 23 February 2026
  • Cross-layer and head-wise reuse are strategies in transformer models that reuse redundant attention computations and key/value projections to enhance efficiency.
  • Techniques like Multi-Query Attention, Grouped-Query Attention, and dynamic head budgets reduce memory and compute demands while maintaining performance.
  • Empirical studies report up to 9× cache reduction and minimal accuracy drops, making these innovations critical for optimizing large language models.

Cross-layer and head-wise reuse refers to a range of architectural and algorithmic innovations in transformer models that aim to reduce redundancy and improve computational and memory efficiency by reusing certain computations. These techniques exploit empirical observations that attention patterns, key/value projections, and outputs are both highly redundant across adjacent layers (cross-layer) and often across heads within the same layer (head-wise). Recent advances leverage these observations in both forward computation of attention maps and efficient management of key-value caches used during inference.

1. Redundancy of Attention Across Layers and Heads

Transformer models, particularly in the context of LLMs, repeatedly compute similar attention patterns across multiple layers and heads. Systematic analyses, such as total-variation similarity and Jensen-Shannon divergence, reveal that layerwise attention maps and key/value projections are highly correlated—adjacent layers often exhibit cosine similarity near 1 for attention matrices and JS-divergence below 0.05 for their distributions (Mu et al., 2024, Bhojanapalli et al., 2021). The same is true to a lesser extent across heads within a layer, particularly when their roles are similar or overlapping. This redundancy motivates strategies to share, align, or reuse attention-related computations to reduce resource demands.

2. Head-wise Reuse: MQA, GQA, and Dynamic Head Budgets

Head-wise reuse first emerged in the context of attention block design with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). In standard multi-head attention (MHA), each query head has its own key/value projection. MQA collapses all key/value heads into a single shared set, while GQA groups heads into G groups, each sharing key/value projections within the group. This reduces memory and compute by a factor of H/G compared to MHA, where H is the number of heads (Brandon et al., 2024). Head-wise reuse is also critical in dynamic budget allocation for compression. LAVa (Shen et al., 11 Sep 2025) introduces dynamic head budgets, ranking all cache entries across heads by a principled score derived from layer attention-output loss and selecting the most informative entries to retain, which is particularly important for extraction tasks.

Method K/V Sharing Granularity Cache Reduction Factor Notable Applications
MHA None (per head) 1× Standard Transformers
MQA All heads within layer H× LLMs/decoding
GQA Groups of heads H/G× LLMs/decoding
LAVa Per-head dynamic Dynamic, ≈9× on 128k Extraction, cache eviction

3. Cross-layer Reuse: Attention and KV Cache

Cross-layer reuse extends sharing from within a layer to across multiple layers. One prominent direction is Cross-Layer Attention (CLA) (Brandon et al., 2024), which reuses key/value (KV) projections across consecutive layers as opposed to recomputing them in every layer. This is typically structured by partitioning layers into groups of size s, with the first layer in each group producing KV embeddings and the rest sharing these representations. When combined with MQA or GQA, this delivers a multiplicative reduction in KV cache size (e.g., CLA2 + MQA yields 2H× reduction over baseline MHA).

A distinct direction is the reuse of attention maps themselves, as in Reuse Transformers (Bhojanapalli et al., 2021), which empirically validate the redundancy between layerwise attention scores and formalize the copying of top-K attention maps from previous layers to be used as-is in subsequent layers’ heads. LiSA (Mu et al., 2024) applies and generalizes this concept to attention matrices, bringing in alignment and low-rank correction to counteract deviations, especially in shallow layers.

Approach Cross-Layer Granularity Empirical Accuracy Loss Typical Reduction
CLA (KV sharing) s layers (s=2–4) <0.1 PPL 2× with s=2 + MQA
ReuseTransformer Select heads per layer None or slight gain ≈P·K compute
LiSA All but some layers <1–3% 53–84% FLOPs, 6× cache

4. Alignment and Correction: Overcoming Head and Layer Sensitivity

Directly sharing attention maps or KV projections can break performance when semantic head roles differ across layers or in early/shallow layers that are more sensitive to perturbations. LiSA (Mu et al., 2024) introduces a learned, tiny feed-forward realignment network over the head dimension to softly permute, re-scale, and match heads between consecutive layers. For residual differences, a low-rank correction attention map is synthesized by projecting inputs into a much lower-dimensional subspace (e.g., r=dk/6r = d_k / 6), incurring negligible additional cost but stabilizing shallow-layer outputs. Empirically, omitting this correction in shallow layers leads to severe degradation (>20 pt accuracy drop). For purely feed-forward reuse, perfect head alignment can only be guaranteed if head ordering is fixed throughout training (as in the LiSA+^+ direct-sharing regime).

5. Cache Compression and Dynamic Budget Allocation

A major application of cross-layer and head-wise reuse is cache compression for LLM inference at long sequence lengths. Beyond static sharing, LAVa (Shen et al., 11 Sep 2025) formulates cache eviction as minimization of attention-output loss. It introduces a value-scaled attention-based score for each candidate K/V entry and uses cross-layer entropy of these scores to allocate cache budgets dynamically per layer (cross-layer budget), while also jointly ranking all candidates across heads within a layer (dynamic head budget). This yields a unified online framework that adapts to model and task context. Experimental findings indicate dynamic layer budgets are crucial for generation tasks, while dynamic head budgets are critical for extraction/RAG-style applications.

6. Empirical Impact and Practical Trade-offs

Cross-layer and head-wise reuse delivers substantial savings in memory, compute, and even inference latency with negligible or minimal degradation in model quality. LiSA, for example, applied to LLaMA2-7B and LLaMA3-8B, compresses the Q/K projections by 6×, eliminates full attention computation in 53–84% of layers, and increases throughput by 19.5–32.3%, while retaining 97%+ downstream benchmark performance (Mu et al., 2024). CLA2+MQA halves inference KV cache size at <0.5% perplexity cost and Pareto-dominates previous MQA/GQA and standard architectures (Brandon et al., 2024). Reuse Transformers match or exceed baseline transformers in BERT, ViT, and T5, reducing FLOPs and parameters by 8–10% with identical or improved accuracy (Bhojanapalli et al., 2021). LAVa achieves ≈9× cache reduction at long contexts with minimal loss (Shen et al., 11 Sep 2025).

7. Extensions, Limitations, and Future Directions

Current cross-layer and head-wise reuse strategies are modular and can be applied as plug-ins to pre-trained models or combined with other efficiency techniques (e.g., MoE routing, multimodal fusion). Alignment and low-rank correction paradigms are especially promising for cross-modal transformers and mixture-of-experts settings. When training from scratch with fixed head order, more aggressive sharing becomes lossless, removing the need for realignment networks (Mu et al., 2024).

However, overly aggressive cross-layer sharing (e.g., s>2 in CLA) can degrade model capacity, especially on harder tasks. Shallow-layer sensitivity remains a challenge—early transformer layers are more fragile to shared representations and require careful correction. A plausible implication is that model-specific calibration of sharing parameters (e.g., fraction of layers, correction rank) is necessary to balance efficiency with adaptation to downstream distributional shifts.


Representative research includes "Cross-layer Attention Sharing for LLMs" (Mu et al., 2024), "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention" (Brandon et al., 2024), "LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation" (Shen et al., 11 Sep 2025), and "Leveraging redundancy in attention with Reuse Transformers" (Bhojanapalli et al., 2021). These works establish that cross-layer and head-wise reuse techniques constitute a principled and empirically validated axis of transformer efficiency improvement for large-scale deep learning systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-layer and Head-wise Reuse.