FusedKV-Lite: Memory-Efficient Transformer Decoding
- The paper introduces FusedKV-Lite, a novel Transformer decoder modification that halves KV cache memory while slightly improving validation loss and perplexity.
- It partitions layers into storage and reconstruction regimes, reusing middle and bottom layer caches to eliminate redundant computations and rotary re-application.
- Empirical results show a 50% reduction in cache memory, halved prefill latency, and maintained I/O volume, making it ideal for long-context LLM inference.
FusedKV-Lite is a memory- and compute-efficient architectural modification for Transformer decoders that significantly reduces the memory footprint of key-value (KV) caches without sacrificing model quality or decoding throughput. By reconceptualizing cache storage and reuse across layers, FusedKV-Lite achieves a 50% reduction in cache memory, maintains I/O parity with vanilla multi-layer decoders, and in empirical evaluations even slightly improves validation loss and perplexity over standard approaches. This method is particularly relevant for LLMs operating at extended sequence lengths, where KV cache memory and latency present substantial scalability challenges (Lin et al., 3 Dec 2025).
1. Architectural Principles and Design
FusedKV-Lite introduces a cross-layer KV cache reuse scheme that partitions Transformer layers into two regimes: storage layers and reconstruction layers. Denote as the total number of decoder layers, as the middle layer index, and as the bottom layer. The distinguishing features are:
- Storage Layers (): These layers compute and retain their own distinct KV caches after rotary positional embedding (RoPE).
- Reconstruction Layers (): These layers cease to compute or store new KV caches. Instead:
- The key cache at layer reuses from layer .
- The value cache at layer reuses from layer .
- RoPE Handling: Storing KV caches post-RoPE obviates the need for rotary re-application during reuse, avoiding computational and I/O overhead.
This layerwise fusion leverages empirical findings that, for top decoder layers, values are primarily influenced by the bottom layer, while keys aggregate information dominated by the middle layers (Lin et al., 3 Dec 2025).
2. Mathematical Formulation
Let be the sequence length and the head dimension. In vanilla Transformer decoders, each layer maintains:
Under FusedKV-Lite, for all :
More generally, a reconstruction function in FusedKV-Lite selects: This deterministic source selection is guided by measured information propagation characteristics within the Transformer stack (Lin et al., 3 Dec 2025).
3. Decoding and Implementation Details
The decoding workflow accommodates FusedKV-Lite with minimal deviation from standard autoregressive loops:
- At prefill, initialize and as empty lists.
- At each generation step :
- For all , compute and append fresh , (post-RoPE) to respective buffers.
- For all , retrieve from and from , skipping computation.
- Any subsequent layer operations (MLP, residuals, etc.) consume these shared KV cache entries.
Rotary reapplication is entirely eliminated for reconstruction layers. Each decoding step requires precisely two buffer loads, maintaining the I/O scheme of vanilla transformers (Lin et al., 3 Dec 2025).
4. Quantitative Performance Analysis
FusedKV-Lite provides strong empirical guarantees across multiple metrics, as summarized:
| Metric | Vanilla | FusedKV-Lite | FusedKV (full) |
|---|---|---|---|
| KV Cache Memory | $2LSd$ | (plus extras for fusion) | |
| Valid Loss (1.5B) | 2.241 | 2.225 (↓0.016) | 2.221 (↓0.020) |
| WikiText PPL | 13.67 | 13.45 (↓0.22) | 13.33 (↓0.34) |
| I/O Volume | $2LSd$ | $2LSd$ (identical) | $3LSd$ (+33%) |
| TTFT (Prefill) | Baseline | 0.5× (↓50%) | 0.5× |
| TPOT | Baseline | Baseline | +1.5× (memory-bound) |
- Memory: FusedKV-Lite achieves 50% reduction in KV-cache memory, storing only $2nSd = LSd$ entries versus $2LSd$ for the standard approach.
- Validation Loss and Perplexity: Slightly better or comparable to baseline, with minimal or no accuracy trade-off.
- I/O Overhead: Maintains exactly the same I/O volume as the standard decoder.
- Latency and Throughput: Prefill latency is halved at long sequence lengths; no reduction in decode throughput except for memory-bound speeds, wherein full FusedKV yields higher throughput (Lin et al., 3 Dec 2025).
5. Comparison to Related Fusion Techniques and Frameworks
FusedKV-Lite is part of a family of cross-layer KV sharing strategies (notably YOCO, CLA) that attempt to suppress cache growth, but it markedly outperforms prior cross-layer methods in both memory efficiency and perplexity. Unlike within-layer approaches (e.g., GQA), FusedKV-Lite's simplicity allows for easy integration without additional learnable parameters or fusion logic.
Further, recent systems work, such as ClusterFusion (Luo et al., 26 Aug 2025), generalizes the operator fusion of FusedKV-Lite by incorporating QKV projection, attention, and output projection into a single on-chip kernel. ClusterFusion employs two collective primitives, ClusterGather and ClusterReduce, to keep all intermediate results on-chip, further reducing global DRAM traffic by 60–80% compared to standard decoding flows. However, FusedKV-Lite remains orthogonal—a cache management scheme—while ClusterFusion addresses kernel fusion and data movement.
A plausible implication is that integrating FusedKV-Lite with advanced operator fusion techniques could multiplicatively compound memory and latency reductions in high-throughput, long-context LLM inference (Luo et al., 26 Aug 2025).
6. Practical Trade-Offs and Deployment Recommendations
Key findings and recommendations for practitioners are:
- I/O Behavior: For I/O-bound settings or minimal engineering changes, FusedKV-Lite should be preferred (reuse from , ).
- Memory vs. Accuracy: FusedKV-Lite offers halved cache memory with negligible or slightly improved perplexity. Full FusedKV using learnable fusion can reduce perplexity further at the cost of 33% more I/O.
- Empirical Basis: Ablations confirm that optimal reuse is achieved by values from the bottom () and keys from the middle layer ().
- Integration: The pseudocode and methodology permit drop-in adoption for existing Transformer inference stacks, with no alteration to on-chip I/O and computational characteristics in the common case.
7. Broader Significance and Research Outlook
FusedKV-Lite advances the state of memory-efficient Transformer inference, particularly in regimes where sequence lengths and model widths stress system resources. By highlighting the differentiated layerwise information contribution of keys and values, it proposes a provisioning strategy that aligns with empirical activation flow. The method also sets a useful baseline for future cache compression and multi-layer fusion schemes.
Recent operator-fused frameworks such as ClusterFusion demonstrate that further improvements are possible by extending the fusion scope beyond KV cache management to the level of computational kernels and cluster-scoped memory, yielding an additional $1.4$– end-to-end speedup over per-operator fused baselines (Luo et al., 26 Aug 2025).
FusedKV-Lite thus marks a crucial step in the broader trajectory toward scalable LLM deployment—enabling large models to operate efficiently at long contexts without redundant cache or compute overhead, and providing a foundation for further system and architectural innovations.