Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 191 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

Shared KV Attention Mechanisms

Updated 15 November 2025
  • Shared KV attention is a technique that reuses and compresses key/value tensors across tokens, layers, or requests to mitigate memory bottlenecks in transformer models.
  • It employs methods like cross-request batching, cross-layer sharing, and head compression to transform memory-bound operations into compute-efficient ones.
  • Empirical evaluations show significant throughput improvements and memory reductions in LLM inference with minimal accuracy trade-offs.

Shared Key-Value (KV) Attention refers to a class of memory- and bandwidth-efficient attention mechanisms in LLMs that increase efficiency by reusing, compressing, or otherwise sharing the key and value tensors across tokens, requests, layers, or even models. These approaches address the principal memory bottleneck in standard transformer inference: the requirement to maintain a growing, per-layer, per-token KV-cache whose size and access cost scale linearly with the number of layers and the context length. By leveraging contextual or structural redundancy—through mechanisms such as shared contexts, chunking, cross-layer cache reuse, or dynamic routing—shared KV attention methods transform memory-bound operations into compute-bound or batch-parallel operations, often yielding dramatic improvements in throughput and scalability.

1. Motivation and Foundational Problem

Autoregressive LLM inference requires storing all previously computed key (K) and value (V) projections for every input token at every attention layer. For a sequence of length LL, hidden dimension dd, number of layers HH, and batch size BB, the total KV-cache memory is O(BLHd)O(B \cdot L \cdot H \cdot d). As context and batch size increase, this leads to:

  • Memory bottleneck: The KV-cache may dominate overall GPU memory use, restricting supported batch size or input length.
  • Bandwidth inefficiency: Standard per-step decoding performs memory-bound GEMV operations for each new token, leaving GPU tensor cores under-utilized.
  • Redundancy: In many production settings (multi-tenant LLM services, few-shot prompting, multi-step reasoning, multi-agent workflows), large portions of the context—such as prompts, prefixes, or background documents—are shared identically across many requests or generation branches.

The foundational opportunity exploited by shared KV attention is that when multiple queries attend to the same context (identical K,V), the attention computation softmax(QK)V\mathrm{softmax}(QK^\top)V can be batched or shared, potentially turning many separate memory-bound vector operations into fewer or single compute-bound matrix multiplications.

2. Mechanisms and Architectural Variants

Shared KV attention mechanisms span a spectrum depending on the unit and dimension of sharing:

A. Cross-Request and Prefix Sharing

  • MoSKA (Rhee et al., 8 Nov 2025): Batches all concurrent queries whose prefix (e.g., system prompt, document chunk) is identical and processes their attention to the shared segment via a single GEMM. The unique context per request is processed independently.
  • ChunkAttention (Ye et al., 23 Feb 2024): Segments K/V into fixed-size chunks along the sequence, indexes chunks in a trie to efficiently share prefix-chunks across multiple sequences, and runs chunk-wise attention where possible.
  • DeFT (Yao et al., 30 Mar 2024): For tree-structured decoding (e.g., speculative, beam, multi-branch inference), identifies and groups shared prefixes, fusing all queries that attend to a given shared KV block and executing attention only once per KV.

B. Cross-Layer and Model-Internal Sharing

  • Shared Attention (SA) (Liao et al., 13 Jul 2024): Shares the entire attention matrix AA (not just K/V) across a block of consecutive layers, exploiting the empirical isotropy of per-layer attention matrices in the mid-to-late transformer layers.
  • KVSharer (Yang et al., 24 Oct 2024): Shares complete K/V caches between dissimilar layers, as determined by their average cache representations, reducing per-token KV memory by grouping layers without retraining.
  • Cross-Layer Attention (CLA) (Brandon et al., 21 May 2024): Proposes that K,V projections of the first layer of a group of ss consecutive layers are reused by all layers in the group; greatly reduces number of distinct K/V caches.

C. KV Head Compression and MoE Routing

  • Grouped/Multi-Query Attention (Yu et al., 11 Jun 2024, Song et al., 16 Jun 2025): Reduces KV cache size by grouping or sharing K/V heads across multiple query heads. MixSGA (Song et al., 16 Jun 2025) further introduces token-level dynamic MoE routing to allocate fine/coarse KV detail adaptively per token importance.
  • Multi-matrix Factorization KV-KR (Hu et al., 26 Dec 2024): Employs low-rank factorization of Q and K, and in MFA-KR, key re-use: reconstructs the value cache from the key cache via a learned linear transformation, sharing buffers at the basis level.

D. Shared KV in Multi-Agent and Communication Settings

  • KVComm (Shi et al., 2 Oct 2025): Shares selected layer-wise KV pairs between LLMs in multi-agent protocols, identifying the most informative layers for sharing via attention-mass scoring and a Gaussian prior over layer depth.

3. Mathematical Underpinnings

All shared KV attention mechanisms exploit the linear algebraic structure of attention:

Standard:Attention(Q,K,V)=softmax(QKd)V\text{Standard:} \quad \text{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V

  • Batching Over Queries: When nn queries share K,VK,V, instead of nn GEMVs, perform QbatchKQ_{\text{batch}} K^\top as a GEMM and proceed with softmax over each row.
  • KV Head Grouping: If ghg\ll h key/value heads (GQA), per-token cache size shrinks proportionally, as all queries in a group share the same K/V head.
  • Cross-Layer Sharing: If adjacent (or selected) layers share K/V (or even AA), then the number of stored K/V caches per sequence per token is reduced by the group/sharing factor ss.
  • MoE Token Routing: Each token is routed to a KV group (expert) which may use a coarser KV representation (fewer unique heads) for less important tokens, but share projection parameters to avoid parameter explosion.

4. Implementation Frameworks and Algorithms

A representative shared KV attention framework typically involves:

Step MoSKA/ChunkAttention SA/CLA/KVSharer mixSGA/Grouped KV
1. Identify Sharing Detect shared prefix or chunk alignments Partition layers into groups Score tokens for grouping
2. Routing MoE top-kk or trie walk Static layer-to-group assignment Per-token router network
3. Share/Batch Batch Qs per shared block, run fused GEMM Share K/V (or AA) buffer per group Group K/V by head
4. Compute Output Abatch=softmax(QK)VA_\text{batch} = \mathrm{softmax}(QK^\top)V Use shared K/V (or AA) within group Fused attention
5. Restore Unique Unique Q->per-request path Exit to unique layers as needed Scatter outputs by routing

Pseudocode for MoSKA shared KV batching:

1
2
3
4
5
6
for i in 1...B:
    idx[i] = topk(scorer(Q[i], E_j), k)
for j in 1...M:
    Q_batch_j = stack(Q[i] for i if j in idx[i])
    A_batch_j = softmax(Q_batch_j @ K_j.T / sqrt(d)) @ V_j
    scatter_results(A_batch_j back to requests)

SA/CLA mechanisms require modifying the attention module to accept, for a block of SS layers, either

  • a shared attention matrix AsharedA_{\text{shared}} (for SA), or
  • a shared K,VK,V cache across the layers in group (for CLA/KVSharer).

In mixSGA, all routing and head-averaging operations are constructed to be fully differentiable during training, but become argmax/hard mask at inference.

5. Quantitative Impacts and Trade-Offs

Empirical studies across proposals demonstrate major resource and performance improvements, with trade-offs in final accuracy tightly controlled by design choices.

Throughput and Efficiency Gains

  • MoSKA (Rhee et al., 8 Nov 2025): Achieves up to 538.7×538.7\times throughput vs. FlashAttention and 85×85\times vs. ChunkAttention when serving $1$M–$16$M token shared contexts.
  • ChunkAttention (Ye et al., 23 Feb 2024): $3.2$–4.8×4.8\times kernel speedup and 1.6×1.6\times end-to-end throughput with 7090%70\text{–}90\% less KV memory when prompts are shared.
  • DeFT (Yao et al., 30 Mar 2024): Eliminates up to 80%80\% of redundant KV I/O, leading to $2$–3.6×3.6\times speedups in tree-structured inference.

Memory Reduction

  • CLA2+MQA (Brandon et al., 21 May 2024): Halves the KV cache memory relative to state-of-the-art MQA, with negligible increase in Perplexity (PPL).
  • KVSharer (Yang et al., 24 Oct 2024): Up to 30%30\% layer-wise cache reduction with >95%>95\% of baseline performance; when combined with intra-layer methods, up to 45%45\% overall reduction.
  • MFA-KR (Hu et al., 26 Dec 2024): Key-reuse variant achieves 56%56\% to 93.7%93.7\% memory savings (by decreasing rKr_K), incurring only <0.3<0.3 PPL or $0.2$ BLEU degradation for moderate settings.

Trade-offs

  • Sharing ratio vs. performance: Aggressive sharing/compression yields larger speed/memory gains but risks accuracy degradation on high-variance or early layers.
  • Placement of sharing: SA/CLA methods perform best when sharing is applied to late/iso-tropic layers; sharing K/V between dissimilar layers (KVSharer) sometimes preserves output better than between similar ones.
  • Routing and compute overhead: MoE strategies (MoSKA, mixSGA) pay a token-level routing overhead but amortize it quickly at scale.
  • Fine-tuning requirement: Most projection- or grouping-based schemes recover or approach original accuracy after a single epoch of LoRA or similar lightweight adaptation.

6. Application Domains and Hardware Considerations

Shared KV attention underpins several applications:

  • Production LLM serving: Multi-tenant and high-context deployments, where system prompts, background docs, or user prefixes recur across requests.
  • Branching inference: Speculative decoding, beam search, tree-of-thoughts, and multi-path reasoning, where multiple generations share large prefixes.
  • Model communication: Multi-agent LLM setups where KV sharing becomes an effective protocol, outperforming natural language or direct hidden state exchange (Shi et al., 2 Oct 2025).
  • Disaggregated serving hardware: MoSKA exemplifies splitting inference workloads across compute-bound shared accelerator nodes and memory-bound unique nodes, each hardware-specialized.

GPU-based implementations benefit from maximizing arithmetic intensity (GEMM batching), minimizing global memory traffic, and fusing QK^\top/softmax/V kernels. Balanced partitioning (e.g., DeFT’s flattened tree-splitting, ChunkAttention’s two-phase kernel) is critical to saturate all GPU streaming multiprocessors.

7. Outlook and Limitations

These advances in shared KV attention have immediately expanded the efficiency envelope for LLMs:

  • Multi-factor and adaptive sharing—cross-layer, cross-request, and per-token MoE routing—can be composed for multiplicative gains.
  • Plug-and-play layer-wise sharing (KVSharer) enables training-free deployment, but may be sensitive to the calibration set and architecture.
  • Generalization to novel architectures or modalities (e.g., cross-architecture sharing, audio/vision transformer models) warrants further paper.
  • Limits of isotropy/importance-based sharing: Unique or high-attention-variance positions (early layers, arithmetic reasoning) are less amenable to aggressive sharing.
  • Systemic integration: Future directions include dynamic, per-example sharing strategies, integration with external memory/retrieval systems, and deeper exploitation of hardware-software co-design.

In summary, shared KV attention mechanisms constitute a central principle for scaling the inference efficiency of autoregressive transformers. Leveraging contextual redundancy along token, layer, and request axes, these approaches systematically transform the dominant memory bottleneck into highly parallelizable, compute-efficient operations, enabling both real-world deployment at scale and new research directions in efficient LLM architectures.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Shared KV Attention Mechanism.