YOCO: Efficient KV Caching for LLMs
- YOCO is a two-stage decoder architecture that reuses a single, global key-value cache, reducing memory usage by an order of magnitude compared to standard Transformers.
- It achieves significant speedups by using an efficient self-decoder for context encoding and a cross-decoder that leverages the precomputed cache for autoregressive generation.
- Extensions like YOCO++ and YOCO-U enhance performance through residual key-value composition and recursive depth scaling, optimizing the efficiency-quality tradeoff.
YOCO (You Only Cache Once) is a decoder-decoder architecture for LLMs that addresses the memory and computational bottlenecks associated with key-value (KV) caching in standard Transformer decoders. By restructuring the decoder into a two-stage pipeline—comprising a self-decoder for efficient context encoding, followed by a cross-decoder that reuses a single, global KV cache—YOCO achieves an order-of-magnitude reduction in memory footprint and a substantial improvement in prefill and generation throughput, without compromising language modeling performance or context length capabilities. Extensions such as YOCO++ and Universal YOCO (YOCO-U) further enhance this efficiency-quality tradeoff with residual KV composition and recursive depth scaling.
1. Architecture and Computational Flow
YOCO divides a standard -layer decoder-only Transformer into two stacked modular components:
- Self-Decoder: The first layers, using an efficient self-attention scheme (e.g., sliding-window or gated retention), encode the full prompt context into a global key-value cache.
- Cross-Decoder: The remaining layers consume the tokens autoregressively, performing cross-attention into the global cache produced by the self-decoder.
The model operates as follows: At the self-decoder–cross-decoder interface, a single set of keys and values is constructed: For , each cross-decoder layer computes
Causal masking is maintained throughout to preserve autoregressive dependencies.
The self-decoder employs either sliding-window or gated retention attention, both of which incur KV memory due to localized or recurrent state organization. The cross-decoder's cross-attention is always to the same precomputed , and no new cache is materialized during generation.
2. Caching Mechanism and Memory Complexity
The conventional Transformer caches KV pairs for every layer and every input token: memory use is 0 for prompt length 1, depth 2, and hidden dimension 3. YOCO, by contrast, caches only:
- The single global cache at layer 4: 5;
- Small local buffers from efficient self-attention: 6 (window size 7), often negligible.
Total KV cache requirement for YOCO is 8, representing an 9-fold memory reduction relative to standard practice (Sun et al., 2024, Sun et al., 1 Apr 2026).
This constant-cache property is preserved during generation—new tokens append only to the global cache, obviating per-layer synchronization. During decoding, cross-attention retrieves from the fixed cache, with no need to recompute embeddings or store additional per-layer states.
3. Prefill Strategy and Inference Efficiency
YOCO’s bifurcated computation enables an early-exit “prefill” optimization:
- Classic Transformers perform 0 layers of self-attention during prefill, with 1 time.
- YOCO limits prefill to just the 2 self-decoder layers, applying only efficient attention per token: cost is 3 — linear, rather than quadratic, in sequence length.
Empirical latency benchmarks (H100-80GB GPU, YOCO4) establish prefill speedups ranging from 52.86 at 7K tokens (3.2s vs 9.1s) to 8389 at 0M (10s vs 380s). Post-prefill, generation switches to the cross-decoder, which behaves identically to a normal decoder, so there is no discrepancy in output or generation semantics (Sun et al., 2024).
4. Model Scaling, Long-Context Performance, and Benchmark Results
At increasing context lengths and model sizes, YOCO demonstrates:
- Memory reduction: At 1M, a 3B-parameter model requires 212.4 GB (YOCO) versus 3114 GB (Transformer), a factor of 9.4 reduction.
- Throughput gains: At 4K, YOCO sustains 43.1 tokens/s, compared to 4.5 tokens/s for the baseline Transformer—a 5 speedup.
- Comparable or improved model quality: Across parameter counts (160M–13B) and training budgets, YOCO matches or slightly exceeds Llama-style Transformer validation loss.
Needle-in-a-haystack (NIAH) retrieval tasks validate YOCO’s long-context capabilities:
- Single-needle accuracy 6 across depths up to 1M tokens.
- Multi-needle retrieval at 128K input length: accuracy 7, matching or exceeding alternative LLMs.
For language modeling across 1M-token prompts, per-token negative log-likelihood steadily improves with context length, demonstrating actual exploitation of the extended history (Sun et al., 2024).
5. Extensions: YOCO++, Universal YOCO (YOCO-U), and Efficiency-Capacity Tradeoffs
YOCO++
YOCO++ enhances the original YOCO by fusing each self-decoder layer's KV with that of the bottom layer using learnable weighted residuals: 8 with 9, typically set to 35, and 0 learned end-to-end. This composition raises expressivity while preserving all the computational and memory benefits of YOCO.
Empirical evaluations (TinyLlama 1.1B, 1, 2):
- Inference throughput and prefill latency at various context lengths match those of YOCO (350% faster than vanilla Transformer).
- YOCO++ attains the lowest training loss and highest average zero-shot accuracy (48.99%) compared to YOCO (47.98%) and FusedKV variants, and outperforms the standard Transformer (48.37%).
- Ablations show both the residual connection and the scaling factor 4 are necessary for best results (Wu et al., 15 Apr 2026).
Universal YOCO (YOCO-U)
Universal YOCO [Editor’s term: "YOCO-U"] incorporates recursion in the self-decoder. Instead of stacking more layers, it iteratively applies the shallow (5-layer) self-decoder block 6 times with shared parameters: 7 The global cache is constructed after recursion. This approach enhances effective depth and representational capacity without increasing cache size, enabling high efficiency even under test-time scaling.
Benchmarks indicate YOCO-U:
- Lowers validation loss over non-recursive YOCO at equal FLOPs and converges in fewer training tokens.
- Achieves higher downstream task performance (+4.45 to +5.30%\ absolute) and a 24.4% average improvement in math reasoning benchmarks.
- Matches RINS in generalization, at substantially lower cache memory (62MB for YOCO-U vs 1.28GB for RINS at 16K context).
- Maintains near-linear prefill cost and one-copy KV caching.
Ablation studies confirm greater returns from recursion in the shallow self-decoder than in deeper blocks or from simple width increases. Diminishing performance gains with increased recursion iterations suggest representational convergence after a small number of repeats (Sun et al., 1 Apr 2026).
6. Practical Considerations, Integration, and Trade-offs
To deploy YOCO or its variants:
- Integration: Implement or use an inference engine supporting the YOCO cache protocol. In the self-decoder, produce each layer’s K,V, optionally fusing with the bottom layer (YOCO++), and cache as prescribed.
- Compression rate: Most evaluations use 8 (caching only 50% of layers); reduced cache rates require empirical performance validation.
- Scaling factor: For YOCO++, set 9 in 0; 1 is default.
- Quality-memory trade-off: YOCO++ at 50% compression matches or improves full-Transformer accuracy and loss, with half the memory and compute during prefill. Aggressive compression (much less than 50%) can degrade performance.
YOCO’s memory and compute scaling—2 for KV cache, 3 for prefill, and no extra decode overhead—enable context lengths and speeds not feasible with standard architectures. YOCO-U further decouples depth and memory, facilitating depth scaling without cache inflation.
7. Future Directions and Comparative Context
YOCO’s core principle—decoupling representational depth from cache growth by splitting attention into "efficient self-decoder" and "cross-decoder"—constitutes a paradigm shift for scalable autoregressive modeling. Empirical evidence supports its architectural and efficiency claims across pretraining, downstream tasks, and extreme-context retrieval (Sun et al., 2024, Wu et al., 15 Apr 2026, Sun et al., 1 Apr 2026).
Future research avenues include:
- Exploring adaptive/conditional recursion within the self-decoder.
- Incorporating advanced subquadratic efficient-attention mechanisms.
- Extending YOCO’s design to multimodal, encoder-decoder, or retrieval-augmented configurations.
By offering a method to increase context length and model depth while maintaining tractable hardware requirements, YOCO and its extensions present a robust foundation for next-generation LLM inference and serve as the basis for new variants in cross-layer KV compression and parameter-efficient scaling.