KV-Cache Fusion in Transformers

Updated 29 March 2026

KV-Cache Fusion is a family of techniques that merge similar key-value cache entries to reduce memory usage and boost Transformer inference efficiency.
It employs methods such as block-level fusion, adaptive token merging, layer-wise sharing, and low-rank projections to compress caches with minimal accuracy loss.
Empirical results show up to 4.38x cache compression and 2-4x throughput improvements, benefiting long-context and high-concurrency inference scenarios.

KV-Cache Fusion refers to a broad family of methods designed to reduce the memory and computational footprint of key-value (KV) caches in Transformer-based LLMs by exploiting similarities or redundancies within or across cache entries. These approaches aim to mitigate the rapidly growing cache costs that arise from the persistence of KV tensors across layers, tokens, requests, and retrieval-augmented input structures—especially critical in long-context or highly concurrent inference scenarios. KV-Cache Fusion encompasses clustering and block-level fusion, layer-wise cache sharing, adaptive merging within sequences, low-rank tensor decomposition, and fusion strategies tailored for retrieval-augmented generation (RAG).

1. Theoretical Foundation and Motivation

Transformer LLMs, during inference, accumulate for each active session a KV-cache recording keys and values for every previous position and for every layer. The total memory required scales linearly with the number of concurrent requests ( $B$ ), layers ( $L$ ), maximum sequence length ( $S$ ), and head dimension ( $d$ ): $M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ . For long inputs and high concurrency, this memory demand dominates GPU resources and severely limits throughput.

KV-Cache Fusion responds to this bottleneck by identifying highly similar or redundant cache blocks—whether across requests, chunks, layers, or tokens—and combining them into shared, compressed, or fused representations. The principal goals are to reduce the number of unique cache entries, decrease memory and I/O costs, and maintain decoding quality and latency under various serving loads (Kampeas et al., 6 Jan 2026, Yang et al., 2024, Wang et al., 2024, Lin et al., 3 Dec 2025).

2. Block-Level and Joint Encoding Fusion

Joint encoding of KV-cache blocks ("Fast Fusion") leverages similarities across requests (Batch Fast Fusion, BFF) or within input chunks (Chunks Fast Fusion, CFF) by fusing blocks whose normalized direction cosine similarity exceeds a threshold $u$ . Blocks assigned to the same cluster are replaced by a shared unit direction and associated scalar norms. The standard block table (pointer-based lookup) is retained, and no tensor kernel rewrites are required; attention computations alias to shared blocks via pointer redirection.

The compression/quality tradeoff is analyzed via a Gaussian-kernel model of similarity scores and Poisson rate estimation:

A higher fusion threshold $u$ lowers distortion but reduces the compression ratio.
Bound on attention-drift is $\|s' - s\|_1 \le 2\epsilon + O(\epsilon^2)$ , with $\epsilon = \max_t \|k_t\|\|q\|/\sqrt{d} \cdot \sqrt{2(1-u)}$ .
Empirical results on Llama3.1-8B and Qwen2.5-72B show up to $4.38\times$ cache compression with $L$ 0 accuracy loss and $L$ 1 throughput improvement under high concurrency (Kampeas et al., 6 Jan 2026).

3. Adaptive Token-Level Fusion

Adaptive merging within a sequence (KVMerger) addresses intra-sequence redundancy by greedily clustering adjacent tokens (within each layer) whose key vectors exhibit pairwise cosine similarity above a threshold $L$ 2. For each "merging set," a Gaussian kernel weighing scheme computes a fused representative (pivoted at the token with maximal attention) via

$L$ 3

with merge weights $L$ 4. This method enforces data-independent, persistent sparsity and achieves compression by discarding non-pivotal tokens. It delivers substantial memory savings (down to 35% of original cache) with negligible accuracy loss, outperforming token eviction-based techniques especially on long contexts (Wang et al., 2024).

4. Layer-Wise and Cross-Layer KV-Cache Fusion

Layer-wise sharing and fusion targets redundancy across layers. In KVSharer, certain layers are selected (by maximizing Euclidean dissimilarity in their mean KV vectors) to reuse the KV cache from others. This can halve or further reduce cache memory with minimal degradation—achieving up to $L$ 5 generation acceleration and maintaining perplexity within $L$ 6 of baseline on various LLMs. This method composes naturally with within-layer token pruning and quantization strategies (Yang et al., 2024).

Cross-layer fusion with data-driven weights (FusedKV) further refines this paradigm by constructing the top-layer KV cache as a learnable, feature-wise linear combination of post-RoPE keys from the bottom- and mid-level layers. Empirical analysis shows keys favor mid-levels while values are best reconstructed from early layers. FusedKV and its I/O-optimized variant (FusedKV-Lite, which reuses only one source per K/V) attain a $L$ 7 memory reduction and can slightly improve perplexity over the baseline Transformer; throughput penalties are minor or amortized on memory-bound pipelines (Lin et al., 3 Dec 2025).

5. Dimension-Reduction and Low-Rank Fusion

Low-rank projection approaches (Palu) compress the KV-cache along the hidden dimension by decomposing the key and value projection matrices via SVD (joint, group, or per-head) and caching only the down-projected latents ( $L$ 8) per token. Keys and values are reconstructed on-demand for attention scoring. Rank allocation is guided by Fisher information, and Walsh-Hadamard transforms mitigate quantization outliers. This method can deliver $L$ 9-- $S$ 0 compression, up to $S$ 1 end-to-end speedup, and near-baseline perplexity—especially when paired with moderate head grouping (Chang et al., 2024).

6. Fusion for Retrieval-Augmented Generation

Fusion strategies for RAG systems address the subtleties of cache sharing with retrieved document chunks.

Naive chunk-level caching misses all cross-chunk attention, producing a substantial drop in output quality.
Prefix–Scale–Recompute (PSR) [Editor's term] fuses prior methods: (i) absorbs chunk head sinks during prefill, (ii) rescales softmax temperature and attention to cached blocks, and (iii) performs selective recomputation of tokens with high $S$ 2 scores. This tripartite fusion closes most of the accuracy gap to full prefill and saves over $S$ 3 compute (Cestola et al., 3 Mar 2026).
FusionRAG optimizes offline chunk fusion (embedding cross-chunk context into each pre-cached chunk via top- $S$ 4 neighbor fusion) and online, question-guided recomputation of the most critical tokens. With as little as $S$ 5 recomputation, it recovers near-oracle generation quality and achieves $S$ 6– $S$ 7 TTFT reductions (Wang et al., 19 Jan 2026).

7. Implementation, Integration, and Performance

A recurrent theme is the avoidance of kernel rewrites and preservation of standard dense tensor layouts. Fused blocks are integrated via indirection tables or pointer manipulation, keeping lookup/update at $S$ 8 and fully compatible with typical GPU-accelerated attention kernels. Overheads from merging/fusion steps are amortized, typically remaining below $S$ 9 of decode time for batch sizes in realistic server settings (Kampeas et al., 6 Jan 2026). Adaptive thresholding allows dynamic tradeoff between memory and quality, and most fusion schemes are compatible with quantization or token-level sparsity.

Empirically, across methods and benchmarks, KV-Cache Fusion delivers memory reductions from $d$ 0 to $d$ 1 and throughput gains from $d$ 2 to $d$ 3, with $d$ 4– $d$ 5 accuracy or perplexity penalty. In RAG-specific tasks, quality recovery can approach $d$ 6 of full-attention performance at one-tenth of the compute cost (Kampeas et al., 6 Jan 2026, Yang et al., 2024, Wang et al., 2024, Lin et al., 3 Dec 2025, Cestola et al., 3 Mar 2026, Wang et al., 19 Jan 2026).

Representative KV-Cache Fusion techniques and properties

Technique	Fusion Scope	Typical Compression	Accuracy Impact
Joint block encoding	Blocks across req./chunks	Up to $d$ 7	$d$ 8 F1 drop
Adaptive merging	Tokens within sequence	$d$ 9– $M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 0	Negligible
Layer-wise sharing	Whole Transformer layers	$M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 1	$M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 2 PPL drop
Low-rank projection	Hidden dim (per head/group)	$M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 3– $M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 4	$M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 5– $M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 6 acc. drop
RAG FusionRAG/PSR	Cross-chunk fusion, selective recompute	$M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 7– $M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 8 (TTFT speedup)	$M = 2 \cdot B \cdot L \cdot S \cdot H \cdot d$ 9 F1 gap vs full prefill

References

"Joint Encoding of KV-Cache Blocks for Scalable LLM Serving" (Kampeas et al., 6 Jan 2026)
"KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing" (Yang et al., 2024)
"Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks" (Wang et al., 2024)
"Palu: Compressing KV-Cache with Low-Rank Projection" (Chang et al., 2024)
"Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers" (Lin et al., 3 Dec 2025)
"An experimental study of KV cache reuse strategies in chunk-level caching systems" (Cestola et al., 3 Mar 2026)
"From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation" (Wang et al., 19 Jan 2026)