Papers
Topics
Authors
Recent
Search
2000 character limit reached

YOCO: Efficient KV Caching for LLMs

Updated 21 May 2026
  • YOCO is a two-stage decoder architecture that reuses a single, global key-value cache, reducing memory usage by an order of magnitude compared to standard Transformers.
  • It achieves significant speedups by using an efficient self-decoder for context encoding and a cross-decoder that leverages the precomputed cache for autoregressive generation.
  • Extensions like YOCO++ and YOCO-U enhance performance through residual key-value composition and recursive depth scaling, optimizing the efficiency-quality tradeoff.

YOCO (You Only Cache Once) is a decoder-decoder architecture for LLMs that addresses the memory and computational bottlenecks associated with key-value (KV) caching in standard Transformer decoders. By restructuring the decoder into a two-stage pipeline—comprising a self-decoder for efficient context encoding, followed by a cross-decoder that reuses a single, global KV cache—YOCO achieves an order-of-magnitude reduction in memory footprint and a substantial improvement in prefill and generation throughput, without compromising language modeling performance or context length capabilities. Extensions such as YOCO++ and Universal YOCO (YOCO-U) further enhance this efficiency-quality tradeoff with residual KV composition and recursive depth scaling.

1. Architecture and Computational Flow

YOCO divides a standard LL-layer decoder-only Transformer into two stacked modular components:

  • Self-Decoder: The first L/2L/2 layers, using an efficient self-attention scheme (e.g., sliding-window or gated retention), encode the full prompt context into a global key-value cache.
  • Cross-Decoder: The remaining L/2L/2 layers consume the tokens autoregressively, performing cross-attention into the global cache produced by the self-decoder.

The model operates as follows: X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases} At the self-decoder–cross-decoder interface, a single set of keys and values is constructed: M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V For l>L/2l > L/2, each cross-decoder layer computes

Q(l)=LN(X(l1))WQ(l),Y(l)=Attention(Q(l),K^,V^)+X(l1)Q^{(l)} = \mathrm{LN}(X^{(l-1)}) W_Q^{(l)}, \qquad Y^{(l)} = \mathrm{Attention}(Q^{(l)}, \hat K, \hat V) + X^{(l-1)}

X(l)=SwiGLU(LN(Y(l)))+Y(l)X^{(l)} = \mathrm{SwiGLU}(\mathrm{LN}(Y^{(l)})) + Y^{(l)}

Causal masking is maintained throughout to preserve autoregressive dependencies.

The self-decoder employs either sliding-window or gated retention attention, both of which incur O(1)\mathcal{O}(1) KV memory due to localized or recurrent state organization. The cross-decoder's cross-attention is always to the same precomputed (K^,V^)(\hat K, \hat V), and no new cache is materialized during generation.

2. Caching Mechanism and Memory Complexity

The conventional Transformer caches KV pairs for every layer and every input token: memory use is L/2L/20 for prompt length L/2L/21, depth L/2L/22, and hidden dimension L/2L/23. YOCO, by contrast, caches only:

  • The single global cache at layer L/2L/24: L/2L/25;
  • Small local buffers from efficient self-attention: L/2L/26 (window size L/2L/27), often negligible.

Total KV cache requirement for YOCO is L/2L/28, representing an L/2L/29-fold memory reduction relative to standard practice (Sun et al., 2024, Sun et al., 1 Apr 2026).

This constant-cache property is preserved during generation—new tokens append only to the global cache, obviating per-layer synchronization. During decoding, cross-attention retrieves from the fixed cache, with no need to recompute embeddings or store additional per-layer states.

3. Prefill Strategy and Inference Efficiency

YOCO’s bifurcated computation enables an early-exit “prefill” optimization:

  • Classic Transformers perform L/2L/20 layers of self-attention during prefill, with L/2L/21 time.
  • YOCO limits prefill to just the L/2L/22 self-decoder layers, applying only efficient attention per token: cost is L/2L/23 — linear, rather than quadratic, in sequence length.

Empirical latency benchmarks (H100-80GB GPU, YOCOL/2L/24) establish prefill speedups ranging from L/2L/252.8L/2L/26 at L/2L/27K tokens (3.2s vs 9.1s) to L/2L/2838L/2L/29 at X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}0M (10s vs 380s). Post-prefill, generation switches to the cross-decoder, which behaves identically to a normal decoder, so there is no discrepancy in output or generation semantics (Sun et al., 2024).

4. Model Scaling, Long-Context Performance, and Benchmark Results

At increasing context lengths and model sizes, YOCO demonstrates:

  • Memory reduction: At X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}1M, a 3B-parameter model requires X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}212.4 GB (YOCO) versus X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}3114 GB (Transformer), a factor of 9.4 reduction.
  • Throughput gains: At X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}4K, YOCO sustains 43.1 tokens/s, compared to 4.5 tokens/s for the baseline Transformer—a X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}5 speedup.
  • Comparable or improved model quality: Across parameter counts (160M–13B) and training budgets, YOCO matches or slightly exceeds Llama-style Transformer validation loss.

Needle-in-a-haystack (NIAH) retrieval tasks validate YOCO’s long-context capabilities:

  • Single-needle accuracy X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}6 across depths up to 1M tokens.
  • Multi-needle retrieval at 128K input length: accuracy X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}7, matching or exceeding alternative LLMs.

For language modeling across 1M-token prompts, per-token negative log-likelihood steadily improves with context length, demonstrating actual exploitation of the extended history (Sun et al., 2024).

5. Extensions: YOCO++, Universal YOCO (YOCO-U), and Efficiency-Capacity Tradeoffs

YOCO++

YOCO++ enhances the original YOCO by fusing each self-decoder layer's KV with that of the bottom layer using learnable weighted residuals: X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}8 with X(0)RN×d,X(l)={SelfDecoder(X(l1))l=1,,L/2 CrossDecoder(X(l1),K^,V^)l=L/2+1,,LX^{(0)} \in \R^{N\times d}, \quad X^{(l)} = \begin{cases} \mathrm{SelfDecoder}(X^{(l-1)}) & l=1,\ldots,L/2\ \mathrm{CrossDecoder}(X^{(l-1)}, \hat K, \hat V) & l=L/2+1,\ldots,L \end{cases}9, typically set to 35, and M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V0 learned end-to-end. This composition raises expressivity while preserving all the computational and memory benefits of YOCO.

Empirical evaluations (TinyLlama 1.1B, M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V1, M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V2):

  • Inference throughput and prefill latency at various context lengths match those of YOCO (M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V350% faster than vanilla Transformer).
  • YOCO++ attains the lowest training loss and highest average zero-shot accuracy (48.99%) compared to YOCO (47.98%) and FusedKV variants, and outperforms the standard Transformer (48.37%).
  • Ablations show both the residual connection and the scaling factor M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V4 are necessary for best results (Wu et al., 15 Apr 2026).

Universal YOCO (YOCO-U)

Universal YOCO [Editor’s term: "YOCO-U"] incorporates recursion in the self-decoder. Instead of stacking more layers, it iteratively applies the shallow (M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V5-layer) self-decoder block M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V6 times with shared parameters: M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V7 The global cache is constructed after recursion. This approach enhances effective depth and representational capacity without increasing cache size, enabling high efficiency even under test-time scaling.

Benchmarks indicate YOCO-U:

  • Lowers validation loss over non-recursive YOCO at equal FLOPs and converges in fewer training tokens.
  • Achieves higher downstream task performance (+4.45 to +5.30%\ absolute) and a 24.4% average improvement in math reasoning benchmarks.
  • Matches RINS in generalization, at substantially lower cache memory (62MB for YOCO-U vs 1.28GB for RINS at 16K context).
  • Maintains near-linear prefill cost and one-copy KV caching.

Ablation studies confirm greater returns from recursion in the shallow self-decoder than in deeper blocks or from simple width increases. Diminishing performance gains with increased recursion iterations suggest representational convergence after a small number of repeats (Sun et al., 1 Apr 2026).

6. Practical Considerations, Integration, and Trade-offs

To deploy YOCO or its variants:

  • Integration: Implement or use an inference engine supporting the YOCO cache protocol. In the self-decoder, produce each layer’s K,V, optionally fusing with the bottom layer (YOCO++), and cache as prescribed.
  • Compression rate: Most evaluations use M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V8 (caching only 50% of layers); reduced cache rates require empirical performance validation.
  • Scaling factor: For YOCO++, set M=X(L/2),K^=LN(M)WK,V^=LN(M)WVM = X^{(L/2)}, \qquad \hat K = \mathrm{LN}(M) W_K,\qquad \hat V = \mathrm{LN}(M) W_V9 in l>L/2l > L/20; l>L/2l > L/21 is default.
  • Quality-memory trade-off: YOCO++ at 50% compression matches or improves full-Transformer accuracy and loss, with half the memory and compute during prefill. Aggressive compression (much less than 50%) can degrade performance.

YOCO’s memory and compute scaling—l>L/2l > L/22 for KV cache, l>L/2l > L/23 for prefill, and no extra decode overhead—enable context lengths and speeds not feasible with standard architectures. YOCO-U further decouples depth and memory, facilitating depth scaling without cache inflation.

7. Future Directions and Comparative Context

YOCO’s core principle—decoupling representational depth from cache growth by splitting attention into "efficient self-decoder" and "cross-decoder"—constitutes a paradigm shift for scalable autoregressive modeling. Empirical evidence supports its architectural and efficiency claims across pretraining, downstream tasks, and extreme-context retrieval (Sun et al., 2024, Wu et al., 15 Apr 2026, Sun et al., 1 Apr 2026).

Future research avenues include:

  • Exploring adaptive/conditional recursion within the self-decoder.
  • Incorporating advanced subquadratic efficient-attention mechanisms.
  • Extending YOCO’s design to multimodal, encoder-decoder, or retrieval-augmented configurations.

By offering a method to increase context length and model depth while maintaining tractable hardware requirements, YOCO and its extensions present a robust foundation for next-generation LLM inference and serve as the basis for new variants in cross-layer KV compression and parameter-efficient scaling.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOCO (You Only Cache Once).