Modular Attention Reuse in Efficient Transformer Models
- Modular Attention Reuse is a framework that partitions, caches, or shares attention computations to reduce redundant operations and memory overhead in neural architectures.
- It leverages methods such as prompt KV caching, head/layer score reuse, and dynamic hybrid policies to significantly accelerate inference while maintaining accuracy.
- Empirical results demonstrate substantial gains, including up to 60× latency reduction in prompt encoding and 46% throughput improvement in long-context inference.
Modular attention reuse encompasses algorithmic and architectural techniques for reducing the computational and memory demands of attention mechanisms by partitioning, caching, or sharing attention computations at the module, layer, or expert level. This strategy has become foundational for efficient inference in LLMs, long-context transformers, efficient vision transformers, fast diffusion sampling, and parameter-efficient adaptation. Across these settings, modular attention reuse leverages empirical redundancy, frequent structural overlap, or explicit modularization in input or model structure to bypass redundant computation, enable memory-efficient sharing, or improve parameter efficiency without sacrificing accuracy.
1. Modular Attention Reuse in Prompt Encoding
Prompt Cache exemplifies modular attention reuse in LLMs by precomputing attention key–value (KV) states for user-defined prompt “modules”—reusable blocks of text (e.g., system prompts, templates, or factual documents)—and caching them for direct reuse during inference. A schema, written in Prompt Markup Language (PML), formally defines modules, their parameters, and composition. At runtime, input prompts are marked up by module, and the server constructs the full KV cache for the prompt by retrieving and concatenating module-level KV states, then encoding only the uncached segments. This architecture requires only minor extensions to standard KV-caching infrastructure and yields substantial improvements in time-to-first-token latency—up to for GPU inference and for CPU inference on long-context tasks—without modification to model weights or accuracy loss. Modules can contain parameters to permit flexible user customization; these are slotted in at prompt time. The method efficiently amortizes the quadratic cost of self-attention over recurring prompt segments, with memory overheads proportional to the KV state per token per module (Gim et al., 2023).
2. Layerwise and Headwise Attention Score Reuse
Empirical analysis of Transformers reveals that attention score matrices display strong redundancy across layers and heads, especially between adjacent layers. The Reuse Transformer exploits this observation by selectively reusing attention scores of certain heads from previous layers, thereby avoiding repeated computation and activation storage for those heads. The mechanism allows a proportion of heads in specified layers to directly inherit previously computed attention maps, skipping their projection and softmax calculation. This delivers proportional reductions in both computational costs and memory usage; for example, reusing half the heads in 10 of 12 layers in BERT-Base reduces FLOPs by and parameters by , with negligible impact on pretraining or downstream task accuracy. This modular approach generalizes to both language and vision models and offers a direct architectural route to leveraging intrinsic redundancy for efficiency (Bhojanapalli et al., 2021).
| Model | Reuse Type | Efficiency Gain | Accuracy Impact |
|---|---|---|---|
| Prompt Cache | Module KV states | $8$– TTFT ↓ | pts worst |
| Reuse Transf. | Head/layer scores | $7$– FLOPs ↓ | typical |
3. Dynamic Hybrid Layer Modular Attention in Long-Context LLMs
HyLRA introduces modular attention reuse at the layer level for efficient long-context inference. Through empirical sparsity profiling, HyLRA classifies each Transformer layer as “sensitive” (requiring fresh dense attention to prevent output distortion) or “tolerant” (attention top- indices highly similar to earlier layers, enabling index reuse). Using a dynamic programming policy computed offline, HyLRA interleaves full-compute and reuse-based (sparse) attention per layer. At inference, sensitive layers compute and store their top- indices, which tolerant layers then reuse for sparse attention, bypassing the need for recalculation. This hybrid approach yields – throughput improvement on long-context benchmarks with accuracy loss, outperforming fixed-pattern sparse methods and affirming the principle that modular, policy-driven attention reuse tightly balances efficiency and fidelity (Ai et al., 31 Jan 2026).
4. Attention Map Reuse in Generative Diffusion Models
In diffusion-based generative models, the repeated computation of attention maps through the iterative sampling trajectory is both computationally expensive and empirically redundant. “Fast Sampling Through the Reuse Of Attention Maps In Diffusion Models” introduces strategies in which, at selected timesteps, cached attention maps are reused rather than freshly recomputed. ODE perturbation theory demonstrates that late-stage reuse minimizes image quality loss, motivating heuristics such as the “HURRY” policy (reuse in later steps) and PSNR-targeted policies (PHAST). These methods, which require neither model retraining nor distillation, deliver $20$– faster image generation at comparable or superior sample fidelity (measured by PSNR, CLIP, and FID) relative to few-step baselines. Ablations confirm that optimal performance is obtained by intelligently allocating reuse to steps with low ODE sensitivity, especially in late sampling (Hunter et al., 2023).
5. Modular Attention Structures for Parameter-Efficient Fine-Tuning
LoRA-Mixer demonstrates a modular, expert-based attention reuse paradigm in parameter-efficient transfer learning. By decomposing attention projection matrices (for , , , and output) into frozen backbone weights plus a mixture-of-experts of low-rank updates (LoRA experts), each token’s attention projection is dynamically adapted via a learned routing distribution over experts. During training, a soft router mixes all experts differentiably; at inference, a sparse top- mixture is selected for each input. This modular integration allows direct reuse of any number of pretrained LoRA adapters, with effective cross-model/domain transfer and minimal fine-tuning overhead. The Specialization-Balance Loss ensures stable, data-efficient router adaptation and expert load-balancing. LoRA-Mixer achieves significant parameter efficiency (using only of adaptation parameters) while outperforming state-of-the-art modular adaptation baselines on a range of classification and reasoning benchmarks (Li et al., 17 Jun 2025).
6. Shared and Grouped Attention for Efficient Vision Transformers
UniForm's Reuse Attention targets the memory and computational bottleneck in multi-head attention for vision transformers by consolidating the computation of the attention matrix across all heads within a block. Instead of separate softmax attention matrices, a single matrix is computed and reused for all heads. Each head then applies a separate value projection and optional depthwise convolution before multiplying by the shared attention. This modular grouping reduces memory movement by up to compared to standard MHA, enabling real-time inference on edge devices, with up to reduction in latency and near parity in top-1 accuracy relative to baseline ViTs and even Flash Attention. Ablation studies confirm negligible accuracy loss from full sharing () and highlight the flexibility of mixing grouping and multi-scale value projections (Yeom et al., 2024).
7. Modular Co-Attention and Blockwise Reuse in Multimodal Architectures
In deep modular co-attention networks (MCAN) for vision–language reasoning, modular attention reuse takes the form of cascaded self-attention (SA) and guided-attention (GA) blocks organized as Modular Co-Attention (MCA) layers. These layers can be stacked serially (with independent parameters per block) or composed in encoder–decoder style, where the final question encoding is reused to modulate image attention through all downstream layers. Empirical results show that this modular reuse of self-attended representations yields steady accuracy gains—up to $1$– on VQA-v2—while allowing fine control of parameter count by varying depth. Modular design fosters transfer to other multimodal tasks and supports integration in both unimodal and multimodal settings (Yu et al., 2019).
In summary, modular attention reuse is a unifying framework for leveraging computational and representational redundancies across prompts, layers, attention heads, timesteps, or adaptation modules. Instantiations span from prompt module KV caching for LLM serving (Gim et al., 2023), layer/head score reuse (Bhojanapalli et al., 2021), policy-driven hybrid sparse reuse (Ai et al., 31 Jan 2026), step-timed map caching in diffusers (Hunter et al., 2023), expert modularization in adaptation (Li et al., 17 Jun 2025), grouped matrix sharing in ViTs (Yeom et al., 2024), to serial attention block reuse in MCANs (Yu et al., 2019). Across these domains, modular attention reuse demonstrably accelerates inference, minimizes memory, and supports parameter-efficient adaptation, all while preserving (or sometimes improving) task accuracy.