Papers
Topics
Authors
Recent
Search
2000 character limit reached

FusedKV: Cross-Layer Transformer Fusion

Updated 10 December 2025
  • FusedKV is a memory-efficient cross-layer key–value cache reconstruction mechanism that fuses bottom- and middle-layer caches using learnable gates to reduce memory by 50%.
  • It preserves relative positional encoding (RoPE) through a per-2D symmetry constraint, ensuring accurate attention without positional drift.
  • Integration with fused kernels enables efficient Transformer decoding with improved perplexity and task accuracy, balancing memory savings and compute overhead.

FusedKV is a memory-efficient cross-layer key–value (KV) cache reconstruction mechanism for Transformer decoders that performs learnable fusion of bottom- and middle-layer caches to halve memory requirements while improving or maintaining predictive accuracy. Unlike earlier cross-layer KV sharing approaches—such as YOCO and CLA—which replace top-layer cache with those from shallower layers but suffer accuracy degradation, FusedKV reconstructs the upper layers’ KV caches on the fly from the most informative source layers using a small set of learnable gates. This design also preserves relative positional encoding (RoPE) structure throughout and is compatible with hardware-efficient kernel fusion.

1. Core Formulation and Mathematical Principles

In a standard decoder with LL layers, FusedKV partitions layers at n=L/2n = L/2 (for even LL):

  • Storage layers LS={1,,n}\mathcal{L}_S = \{1,\dots,n\}: Store and maintain native KV caches K1,,KnK^1,\dots,K^n and V1,,VnV^1,\dots,V^n.
  • Reconstruction layers LR={n+1,,L}\mathcal{L}_R = \{n+1,\dots,L\}: Do not store native caches but reconstruct them at each decoding step.

Let Kj,VjRS×dK^j, V^j \in \mathbb{R}^{S\times d} denote the key and value caches of storage layer jj, where SS is the prompt length and dd is head dimension. For each reconstruction layer i>ni > n:

Ki=Wi,1K1+Wi,nKn, Vi=Ui,1V1+Ui,nVn\boxed{ K^i = W_{i,1} \odot K^1 + W_{i,n} \odot K^n,\ V^i = U_{i,1} \odot V^1 + U_{i,n} \odot V^n }

where Wi,1,Wi,n,Ui,1,Ui,nRdW_{i,1}, W_{i,n}, U_{i,1}, U_{i,n} \in \mathbb{R}^{d} are learnable, broadcasted, feature-wise gates (with per-2D symmetry for RoPE), and \odot denotes element-wise multiplication over all S×dS\times d entries. Reconstruction is performed entirely on top of already-rotated keys/values, with no additional key/value projections for the reconstruction layers. A symmetry constraint Wi,[2j]=Wi,[2j+1]W_{i,\ast}[2j] = W_{i,\ast}[2j+1] maintains RoPE compatibility (Lin et al., 3 Dec 2025).

2. RoPE-Preserving Cross-Layer Fusion

Transformer decoders with RoPE encode each key kRdk \in \mathbb{R}^d as k~=Rtk\widetilde{k} = \mathbf{R}_t k per position tt. In FusedKV, layers store only the post-RoPE output K~1,K~n\widetilde{K}^1, \widetilde{K}^n and fuse in this space:

K~i=Wi,1K~1+Wi,nK~n\widetilde{K}^i = W_{i,1} \odot \widetilde{K}^1 + W_{i,n} \odot \widetilde{K}^n

Crucially, the per-2D symmetry of the gates ensures that the attention logits q~mTK~ni\widetilde{q}^T_m \widetilde{K}^i_n for token mm and position nn remain functions only of (mn)(m-n), preserving relative location and avoiding positional drift—a problem in naïve fusions. No recomputation or reapplication of RoPE is needed during reconstruction, further reducing compute overhead.

3. Decoding Architecture and Implementation

The FusedKV procedure is instantiated as follows:

  • For i=1,...,Li=1,...,L, each new token tt triggers query projection at all layers.
  • Storage layers (ini \leq n): Perform usual key and value projections and cache results.
  • Reconstruction layers (i>ni > n): Reconstruct keys and values for all positions using stored caches from layers 1 and nn, applying per-layer learned gates.
  • During attention, the reconstructed KiK^i and ViV^i are used as in a standard decoder.

The approach can be efficiently implemented with kernel fusion: for every token and reconstruction layer, a fused Triton or CUDA kernel reads K~1,K~n\widetilde{K}^1,\widetilde{K}^n, applies the two gates, sums, and writes the results in a single pass, maximizing shared-memory bandwidth and minimizing extra memory I/O. Gradient flows through the fusion gates for all steps, which are learned end-to-end with the rest of the model.

4. Complexity and Memory Trade-offs

The following table summarizes memory and I/O requirements:

Method Cache Memory Cache I/O per Token
Vanilla (MHA) 2NSHkd2 N S H_k d 2NSHkd2 N S H_k d
CLA / YOCO 12NSHkd\frac{1}{2} N S H_k d NSHkdN S H_k d
GQA 2NSHkd2 N S H_k d 2NSHkd2 N S H_k d
FusedKV-Lite 12NSHkd\frac{1}{2} N S H_k d NSHkdN S H_k d
FusedKV 12NSHkd\frac{1}{2} N S H_k d 3NSHkd3 N S H_k d

Both FusedKV and FusedKV-Lite cut KV memory by exactly 50%50\% versus vanilla, storing only bottom and middle layers’ caches. FusedKV incurs extra I/O (threefold vanilla)—one read for each of two sources per key/value—while FusedKV-Lite matches CLA/YOCO in I/O but requires no compute at inference. This overhead is usually offset by bottlenecks elsewhere and can be efficiently handled in fused-kernel implementations.

5. Empirical Results and Ablation Studies

Across model sizes (332M to 4B parameters) and datasets (FineWeb-Edu, WikiText), FusedKV achieves the following:

  • KV memory reduced by exactly 50%50\%.
  • Validation perplexity generally improves:

| Model Size | Vanilla | FusedKV | FusedKV-Lite | |--------------|-----------|-----------|--------------| | 332M | 22.85 | 22.35 | 22.78 | | 650M | 18.47 | 18.09 | 18.55 | | 1.5B | 13.67 | 13.33 | 13.45 | | 4B | 9.18 | 8.94 | N/A |

  • Downstream five-shot accuracy (e.g., MMLU, HellaSwag, ARC) consistently meets or exceeds vanilla.
  • Ablations on FusedKV-Lite's source layers confirm the optimal asymmetry: values from layer 1 and keys from layer nn; reversing or choosing intermediate sources degrades outcomes.
  • Adding learnable gates to FusedKV-Lite (“FusedKV-Lite-Learnable”) further increases accuracy over simple fixed re-use.

6. FusedKV-Lite: Lightweight Variant

FusedKV-Lite removes all fusion computation at inference. For each reconstruction layer i>ni>n:

Ki=Kn,Vi=V1K^i = K^n,\quad V^i = V^1

This reduces KV memory by 50%50\% and does not increase cache I/O compared to vanilla Transformers. In practice, FusedKV-Lite yields only a modest perplexity increase (e.g., 13.3313.4513.33 \rightarrow 13.45 for 1.5B parameters) but preserves end-to-end throughput. It is best suited for I/O-bound deployments where minimal runtime compute is paramount.

7. Practical Integration and Implementation Guidance

To integrate FusedKV into a Transformer decoder:

  • Introduce a “KV-reconstruction” hook before attention in every reconstruction layer.
  • Bypass the layer’s own native key/value projections.
  • At each inference step, fetch stored post-RoPE caches from layers 1 and nn.
  • Apply per-layer, broadcasted gates using a fused GEMM-style kernel.
  • Enforce the 2-D symmetry constraint on fusion parameters to avoid RoPE corruption, e.g., by doubling each value in a d/2d/2-vector or directly hard-tying parameters.
  • Tune kernel block sizes to optimize memory access patterns and maximize GPU utilization.
  • Ensure gradient flows are correctly propagated through all attention paths to the gating vectors.

Potential pitfalls include failing to maintain weight symmetry (breaking RoPE), introducing spurious gradient accumulation, and suboptimal kernel tiling leading to memory misalignment. Adhering to the recommended fused-kernel implementation with careful parameterization ensures correctness and maximal performance (Lin et al., 3 Dec 2025).


FusedKV represents a principled, general-purpose cross-layer KV fusion strategy for Transformer inference, offering precise RoPE preservation, 50% reduction in KV cache memory overhead, and, in most setups, improved perplexity and task accuracy with minimal impact on runtime throughput. Its lightweight FusedKV-Lite variant minimizes all runtime compute overheads while maintaining comparable predictive performance, making it particularly attractive for large-scale, I/O-bound inference scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FusedKV.