Papers
Topics
Authors
Recent
Search
2000 character limit reached

KV-Cache Fusion in Transformers

Updated 29 March 2026
  • KV-Cache Fusion is a family of techniques that merge similar key-value cache entries to reduce memory usage and boost Transformer inference efficiency.
  • It employs methods such as block-level fusion, adaptive token merging, layer-wise sharing, and low-rank projections to compress caches with minimal accuracy loss.
  • Empirical results show up to 4.38x cache compression and 2-4x throughput improvements, benefiting long-context and high-concurrency inference scenarios.

KV-Cache Fusion refers to a broad family of methods designed to reduce the memory and computational footprint of key-value (KV) caches in Transformer-based LLMs by exploiting similarities or redundancies within or across cache entries. These approaches aim to mitigate the rapidly growing cache costs that arise from the persistence of KV tensors across layers, tokens, requests, and retrieval-augmented input structures—especially critical in long-context or highly concurrent inference scenarios. KV-Cache Fusion encompasses clustering and block-level fusion, layer-wise cache sharing, adaptive merging within sequences, low-rank tensor decomposition, and fusion strategies tailored for retrieval-augmented generation (RAG).

1. Theoretical Foundation and Motivation

Transformer LLMs, during inference, accumulate for each active session a KV-cache recording keys and values for every previous position and for every layer. The total memory required scales linearly with the number of concurrent requests (BB), layers (LL), maximum sequence length (SS), and head dimension (dd): M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d. For long inputs and high concurrency, this memory demand dominates GPU resources and severely limits throughput.

KV-Cache Fusion responds to this bottleneck by identifying highly similar or redundant cache blocks—whether across requests, chunks, layers, or tokens—and combining them into shared, compressed, or fused representations. The principal goals are to reduce the number of unique cache entries, decrease memory and I/O costs, and maintain decoding quality and latency under various serving loads (Kampeas et al., 6 Jan 2026, Yang et al., 2024, Wang et al., 2024, Lin et al., 3 Dec 2025).

2. Block-Level and Joint Encoding Fusion

Joint encoding of KV-cache blocks ("Fast Fusion") leverages similarities across requests (Batch Fast Fusion, BFF) or within input chunks (Chunks Fast Fusion, CFF) by fusing blocks whose normalized direction cosine similarity exceeds a threshold uu. Blocks assigned to the same cluster are replaced by a shared unit direction and associated scalar norms. The standard block table (pointer-based lookup) is retained, and no tensor kernel rewrites are required; attention computations alias to shared blocks via pointer redirection.

The compression/quality tradeoff is analyzed via a Gaussian-kernel model of similarity scores and Poisson rate estimation:

  • A higher fusion threshold uu lowers distortion but reduces the compression ratio.
  • Bound on attention-drift is ss12ϵ+O(ϵ2)\|s' - s\|_1 \le 2\epsilon + O(\epsilon^2), with ϵ=maxtktq/d2(1u)\epsilon = \max_t \|k_t\|\|q\|/\sqrt{d} \cdot \sqrt{2(1-u)}.
  • Empirical results on Llama3.1-8B and Qwen2.5-72B show up to 4.38×4.38\times cache compression with LL0 accuracy loss and LL1 throughput improvement under high concurrency (Kampeas et al., 6 Jan 2026).

3. Adaptive Token-Level Fusion

Adaptive merging within a sequence (KVMerger) addresses intra-sequence redundancy by greedily clustering adjacent tokens (within each layer) whose key vectors exhibit pairwise cosine similarity above a threshold LL2. For each "merging set," a Gaussian kernel weighing scheme computes a fused representative (pivoted at the token with maximal attention) via

LL3

with merge weights LL4. This method enforces data-independent, persistent sparsity and achieves compression by discarding non-pivotal tokens. It delivers substantial memory savings (down to 35% of original cache) with negligible accuracy loss, outperforming token eviction-based techniques especially on long contexts (Wang et al., 2024).

4. Layer-Wise and Cross-Layer KV-Cache Fusion

Layer-wise sharing and fusion targets redundancy across layers. In KVSharer, certain layers are selected (by maximizing Euclidean dissimilarity in their mean KV vectors) to reuse the KV cache from others. This can halve or further reduce cache memory with minimal degradation—achieving up to LL5 generation acceleration and maintaining perplexity within LL6 of baseline on various LLMs. This method composes naturally with within-layer token pruning and quantization strategies (Yang et al., 2024).

Cross-layer fusion with data-driven weights (FusedKV) further refines this paradigm by constructing the top-layer KV cache as a learnable, feature-wise linear combination of post-RoPE keys from the bottom- and mid-level layers. Empirical analysis shows keys favor mid-levels while values are best reconstructed from early layers. FusedKV and its I/O-optimized variant (FusedKV-Lite, which reuses only one source per K/V) attain a LL7 memory reduction and can slightly improve perplexity over the baseline Transformer; throughput penalties are minor or amortized on memory-bound pipelines (Lin et al., 3 Dec 2025).

5. Dimension-Reduction and Low-Rank Fusion

Low-rank projection approaches (Palu) compress the KV-cache along the hidden dimension by decomposing the key and value projection matrices via SVD (joint, group, or per-head) and caching only the down-projected latents (LL8) per token. Keys and values are reconstructed on-demand for attention scoring. Rank allocation is guided by Fisher information, and Walsh-Hadamard transforms mitigate quantization outliers. This method can deliver LL9--SS0 compression, up to SS1 end-to-end speedup, and near-baseline perplexity—especially when paired with moderate head grouping (Chang et al., 2024).

6. Fusion for Retrieval-Augmented Generation

Fusion strategies for RAG systems address the subtleties of cache sharing with retrieved document chunks.

  • Naive chunk-level caching misses all cross-chunk attention, producing a substantial drop in output quality.
  • Prefix–Scale–Recompute (PSR) [Editor's term] fuses prior methods: (i) absorbs chunk head sinks during prefill, (ii) rescales softmax temperature and attention to cached blocks, and (iii) performs selective recomputation of tokens with high SS2 scores. This tripartite fusion closes most of the accuracy gap to full prefill and saves over SS3 compute (Cestola et al., 3 Mar 2026).
  • FusionRAG optimizes offline chunk fusion (embedding cross-chunk context into each pre-cached chunk via top-SS4 neighbor fusion) and online, question-guided recomputation of the most critical tokens. With as little as SS5 recomputation, it recovers near-oracle generation quality and achieves SS6–SS7 TTFT reductions (Wang et al., 19 Jan 2026).

7. Implementation, Integration, and Performance

A recurrent theme is the avoidance of kernel rewrites and preservation of standard dense tensor layouts. Fused blocks are integrated via indirection tables or pointer manipulation, keeping lookup/update at SS8 and fully compatible with typical GPU-accelerated attention kernels. Overheads from merging/fusion steps are amortized, typically remaining below SS9 of decode time for batch sizes in realistic server settings (Kampeas et al., 6 Jan 2026). Adaptive thresholding allows dynamic tradeoff between memory and quality, and most fusion schemes are compatible with quantization or token-level sparsity.

Empirically, across methods and benchmarks, KV-Cache Fusion delivers memory reductions from dd0 to dd1 and throughput gains from dd2 to dd3, with dd4–dd5 accuracy or perplexity penalty. In RAG-specific tasks, quality recovery can approach dd6 of full-attention performance at one-tenth of the compute cost (Kampeas et al., 6 Jan 2026, Yang et al., 2024, Wang et al., 2024, Lin et al., 3 Dec 2025, Cestola et al., 3 Mar 2026, Wang et al., 19 Jan 2026).


Representative KV-Cache Fusion techniques and properties

Technique Fusion Scope Typical Compression Accuracy Impact
Joint block encoding Blocks across req./chunks Up to dd7 dd8 F1 drop
Adaptive merging Tokens within sequence dd9–M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d0 Negligible
Layer-wise sharing Whole Transformer layers M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d1 M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d2 PPL drop
Low-rank projection Hidden dim (per head/group) M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d3–M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d4 M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d5–M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d6 acc. drop
RAG FusionRAG/PSR Cross-chunk fusion, selective recompute M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d7–M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d8 (TTFT speedup) M=2BLSHdM = 2 \cdot B \cdot L \cdot S \cdot H \cdot d9 F1 gap vs full prefill

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KV-Cache Fusion.