Shared KV Attention Mechanisms

Updated 15 November 2025

Shared KV attention is a technique that reuses and compresses key/value tensors across tokens, layers, or requests to mitigate memory bottlenecks in transformer models.
It employs methods like cross-request batching, cross-layer sharing, and head compression to transform memory-bound operations into compute-efficient ones.
Empirical evaluations show significant throughput improvements and memory reductions in LLM inference with minimal accuracy trade-offs.

Shared Key-Value (KV) Attention refers to a class of memory- and bandwidth-efficient attention mechanisms in LLMs that increase efficiency by reusing, compressing, or otherwise sharing the key and value tensors across tokens, requests, layers, or even models. These approaches address the principal memory bottleneck in standard transformer inference: the requirement to maintain a growing, per-layer, per-token KV-cache whose size and access cost scale linearly with the number of layers and the context length. By leveraging contextual or structural redundancy—through mechanisms such as shared contexts, chunking, cross-layer cache reuse, or dynamic routing—shared KV attention methods transform memory-bound operations into compute-bound or batch-parallel operations, often yielding dramatic improvements in throughput and scalability.

1. Motivation and Foundational Problem

Autoregressive LLM inference requires storing all previously computed key (K) and value (V) projections for every input token at every attention layer. For a sequence of length $L$ , hidden dimension $d$ , number of layers $H$ , and batch size $B$ , the total KV-cache memory is $O(B \cdot L \cdot H \cdot d)$ . As context and batch size increase, this leads to:

Memory bottleneck: The KV-cache may dominate overall GPU memory use, restricting supported batch size or input length.
Bandwidth inefficiency: Standard per-step decoding performs memory-bound GEMV operations for each new token, leaving GPU tensor cores under-utilized.
Redundancy: In many production settings (multi-tenant LLM services, few-shot prompting, multi-step reasoning, multi-agent workflows), large portions of the context—such as prompts, prefixes, or background documents—are shared identically across many requests or generation branches.

The foundational opportunity exploited by shared KV attention is that when multiple queries attend to the same context (identical K,V), the attention computation $\mathrm{softmax}(QK^\top)V$ can be batched or shared, potentially turning many separate memory-bound vector operations into fewer or single compute-bound matrix multiplications.

2. Mechanisms and Architectural Variants

Shared KV attention mechanisms span a spectrum depending on the unit and dimension of sharing:

MoSKA (Rhee et al., 8 Nov 2025): Batches all concurrent queries whose prefix (e.g., system prompt, document chunk) is identical and processes their attention to the shared segment via a single GEMM. The unique context per request is processed independently.
ChunkAttention (Ye et al., 23 Feb 2024): Segments K/V into fixed-size chunks along the sequence, indexes chunks in a trie to efficiently share prefix-chunks across multiple sequences, and runs chunk-wise attention where possible.
DeFT (Yao et al., 30 Mar 2024): For tree-structured decoding (e.g., speculative, beam, multi-branch inference), identifies and groups shared prefixes, fusing all queries that attend to a given shared KV block and executing attention only once per KV.

Shared Attention (SA) (Liao et al., 13 Jul 2024): Shares the entire attention matrix $A$ (not just K/V) across a block of consecutive layers, exploiting the empirical isotropy of per-layer attention matrices in the mid-to-late transformer layers.
KVSharer (Yang et al., 24 Oct 2024): Shares complete K/V caches between dissimilar layers, as determined by their average cache representations, reducing per-token KV memory by grouping layers without retraining.
Cross-Layer Attention (CLA) (Brandon et al., 21 May 2024): Proposes that K,V projections of the first layer of a group of $s$ consecutive layers are reused by all layers in the group; greatly reduces number of distinct K/V caches.

C. KV Head Compression and MoE Routing

Grouped/Multi-Query Attention (Yu et al., 11 Jun 2024, Song et al., 16 Jun 2025): Reduces KV cache size by grouping or sharing K/V heads across multiple query heads. MixSGA (Song et al., 16 Jun 2025) further introduces token-level dynamic MoE routing to allocate fine/coarse KV detail adaptively per token importance.
Multi-matrix Factorization KV-KR (Hu et al., 26 Dec 2024): Employs low-rank factorization of Q and K, and in MFA-KR, key re-use: reconstructs the value cache from the key cache via a learned linear transformation, sharing buffers at the basis level.

D. Shared KV in Multi-Agent and Communication Settings

KVComm (Shi et al., 2 Oct 2025): Shares selected layer-wise KV pairs between LLMs in multi-agent protocols, identifying the most informative layers for sharing via attention-mass scoring and a Gaussian prior over layer depth.

3. Mathematical Underpinnings

All shared KV attention mechanisms exploit the linear algebraic structure of attention:

$\text{Standard:} \quad \text{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$

Batching Over Queries: When $n$ queries share $K,V$ , instead of $n$ GEMVs, perform $Q_{\text{batch}} K^\top$ as a GEMM and proceed with softmax over each row.
KV Head Grouping: If $g\ll h$ key/value heads (GQA), per-token cache size shrinks proportionally, as all queries in a group share the same K/V head.
Cross-Layer Sharing: If adjacent (or selected) layers share K/V (or even $A$ ), then the number of stored K/V caches per sequence per token is reduced by the group/sharing factor $s$ .
MoE Token Routing: Each token is routed to a KV group (expert) which may use a coarser KV representation (fewer unique heads) for less important tokens, but share projection parameters to avoid parameter explosion.

4. Implementation Frameworks and Algorithms

A representative shared KV attention framework typically involves:

Step	MoSKA/ChunkAttention	SA/CLA/KVSharer	mixSGA/Grouped KV
1. Identify Sharing	Detect shared prefix or chunk alignments	Partition layers into groups	Score tokens for grouping
2. Routing	MoE top- $k$ or trie walk	Static layer-to-group assignment	Per-token router network
3. Share/Batch	Batch Qs per shared block, run fused GEMM	Share K/V (or $A$ ) buffer per group	Group K/V by head
4. Compute Output	$A_\text{batch} = \mathrm{softmax}(QK^\top)V$	Use shared K/V (or $A$ ) within group	Fused attention
5. Restore Unique	Unique Q->per-request path	Exit to unique layers as needed	Scatter outputs by routing

Pseudocode for MoSKA shared KV batching:

for i in 1...B:
    idx[i] = topk(scorer(Q[i], E_j), k)
for j in 1...M:
    Q_batch_j = stack(Q[i] for i if j in idx[i])
    A_batch_j = softmax(Q_batch_j @ K_j.T / sqrt(d)) @ V_j
    scatter_results(A_batch_j back to requests)

SA/CLA mechanisms require modifying the attention module to accept, for a block of $S$ layers, either

a shared attention matrix $A_{\text{shared}}$ (for SA), or
a shared $K,V$ cache across the layers in group (for CLA/KVSharer).

In mixSGA, all routing and head-averaging operations are constructed to be fully differentiable during training, but become argmax/hard mask at inference.

5. Quantitative Impacts and Trade-Offs

Empirical studies across proposals demonstrate major resource and performance improvements, with trade-offs in final accuracy tightly controlled by design choices.

Throughput and Efficiency Gains

MoSKA (Rhee et al., 8 Nov 2025): Achieves up to $538.7\times$ throughput vs. FlashAttention and $85\times$ vs. ChunkAttention when serving $1$M–$16$M token shared contexts.
ChunkAttention (Ye et al., 23 Feb 2024): $3.2$– $4.8\times$ kernel speedup and $1.6\times$ end-to-end throughput with $70\text{–}90\%$ less KV memory when prompts are shared.
DeFT (Yao et al., 30 Mar 2024): Eliminates up to $80\%$ of redundant KV I/O, leading to $2$– $3.6\times$ speedups in tree-structured inference.

Memory Reduction

CLA2+MQA (Brandon et al., 21 May 2024): Halves the KV cache memory relative to state-of-the-art MQA, with negligible increase in Perplexity (PPL).
KVSharer (Yang et al., 24 Oct 2024): Up to $30\%$ layer-wise cache reduction with $>95\%$ of baseline performance; when combined with intra-layer methods, up to $45\%$ overall reduction.
MFA-KR (Hu et al., 26 Dec 2024): Key-reuse variant achieves $56\%$ to $93.7\%$ memory savings (by decreasing $r_K$ ), incurring only $<0.3$ PPL or $0.2$ BLEU degradation for moderate settings.

Trade-offs

Sharing ratio vs. performance: Aggressive sharing/compression yields larger speed/memory gains but risks accuracy degradation on high-variance or early layers.
Placement of sharing: SA/CLA methods perform best when sharing is applied to late/iso-tropic layers; sharing K/V between dissimilar layers (KVSharer) sometimes preserves output better than between similar ones.
Routing and compute overhead: MoE strategies (MoSKA, mixSGA) pay a token-level routing overhead but amortize it quickly at scale.
Fine-tuning requirement: Most projection- or grouping-based schemes recover or approach original accuracy after a single epoch of LoRA or similar lightweight adaptation.

6. Application Domains and Hardware Considerations

Shared KV attention underpins several applications:

Production LLM serving: Multi-tenant and high-context deployments, where system prompts, background docs, or user prefixes recur across requests.
Branching inference: Speculative decoding, beam search, tree-of-thoughts, and multi-path reasoning, where multiple generations share large prefixes.
Model communication: Multi-agent LLM setups where KV sharing becomes an effective protocol, outperforming natural language or direct hidden state exchange (Shi et al., 2 Oct 2025).
Disaggregated serving hardware: MoSKA exemplifies splitting inference workloads across compute-bound shared accelerator nodes and memory-bound unique nodes, each hardware-specialized.

GPU-based implementations benefit from maximizing arithmetic intensity (GEMM batching), minimizing global memory traffic, and fusing QK $^\top$ /softmax/V kernels. Balanced partitioning (e.g., DeFT’s flattened tree-splitting, ChunkAttention’s two-phase kernel) is critical to saturate all GPU streaming multiprocessors.

7. Outlook and Limitations

These advances in shared KV attention have immediately expanded the efficiency envelope for LLMs:

Multi-factor and adaptive sharing—cross-layer, cross-request, and per-token MoE routing—can be composed for multiplicative gains.
Plug-and-play layer-wise sharing (KVSharer) enables training-free deployment, but may be sensitive to the calibration set and architecture.
Generalization to novel architectures or modalities (e.g., cross-architecture sharing, audio/vision transformer models) warrants further study.
Limits of isotropy/importance-based sharing: Unique or high-attention-variance positions (early layers, arithmetic reasoning) are less amenable to aggressive sharing.
Systemic integration: Future directions include dynamic, per-example sharing strategies, integration with external memory/retrieval systems, and deeper exploitation of hardware-software co-design.

In summary, shared KV attention mechanisms constitute a central principle for scaling the inference efficiency of autoregressive transformers. Leveraging contextual redundancy along token, layer, and request axes, these approaches systematically transform the dominant memory bottleneck into highly parallelizable, compute-efficient operations, enabling both real-world deployment at scale and new research directions in efficient LLM architectures.