Lazy Decoder-Only Architecture

Updated 23 March 2026

Lazy decoder-only architectures are neural models that minimize encoder computations by reusing cached early-layer outputs across applications like LLMs, vision, ASR, and quantum decoding.
They employ methods such as KV-caching, cycle-based refilling, and dynamic layer skipping to significantly reduce FLOPs and latency with minimal impact on accuracy.
Practical implementations demonstrate up to 2× speedup and 94% FLOP reduction in tasks ranging from language modeling to recommendation systems and error correction.

A lazy decoder-only architecture denotes a class of neural model designs—most notably in transformers and deep learning, but also appearing in quantum decoding, recommendation, speech, and vision domains—that aggressively eliminates redundant encoder or early-layer computation by amortizing, caching, reusing, or bypassing the expensive stages of classical “encode–decode” or deep-stack forward passes. These architectures emphasize minimal encoder computation, heavy reliance on cached or pre-computed features, and focus compute or parameterization almost exclusively in the decoder or late-phase blocks. The result is often dramatic reduction in floating-point operations (FLOPs), hardware resource requirements, and latency, sometimes at trivial loss in benchmarked accuracy or generation quality. Recent work further distinguishes between static lazy execution (always skipping early stages) and more dynamic or sample-adaptive forms.

1. Canonical Designs and Motivational Principles

The essential motivation for lazy decoder-only architectures is the high cost of repeated encoder or deep-layer computation in tasks involving autoregressive generation, streaming, or sequence prediction. Empirical studies show that, e.g., in transformer-based LLMs, early layers extract generic context, middle layers refine task- or instance-specific representations, and late layers handle (token-wise) output construction. Since early and middle layers' outputs are highly reusable, traversing them afresh at each generation step is wasteful. This observation underlies the Direct Multi-Token Decoding (DMTD) paradigm and similar approaches in other domains (Luo et al., 13 Oct 2025, Zhou et al., 28 Aug 2025, Sun et al., 2024).

In quantum error correction, the "lazy decoder" is a local, hardware-efficient circuit that detects and fixes simple error syndromes in situ and only invokes full decoders for complex patterns. The same logic—frontloading lightweight processing, minimizing off-chip traffic and hardware—applies to hierarchical lazy decoder designs (Delfosse, 2020).

Across applications, lazy decoder-only architectures consistently pursue:

Reduction or amortization of encoder or early-layer compute.
Maximization of parameter and FLOP utilization in the decoder–output stages.
Exploitation of representation reuse via sophisticated key-value caching, cross-attention from compressed context, or selection of informative handcrafted features.

2. Typical Architectures and Inference Schemes

2.1 Direct Multi-Token Decoding (DMTD)

DMTD partitions transformer layers into early, middle, and late groups. For a cycle of τ tokens:

Compute the first token by passing context through all L layers.
For tokens 2…τ in the cycle, freeze early+middle-layer KV-cache and re-run only the late layers—substantial reduction in computation per token.
At the end of each cycle, refresh the early+middle KV-cache with the outputs of the full stack on the new context window.

No model parameters or auxiliary heads are introduced. All persistent state resides in two KV-caches, one each for early/middle and late layers (Luo et al., 13 Oct 2025).

2.2 Lazy Cross-Attention Recommendation (OneRec-V2)

OneRec-V2 replaces a classic encoder–decoder stack with a Context Processor (amortized feature/key/value computation) and a decoder stack where each block receives fixed cross-attention keys/values, shared and reused globally. No encoder–decoder attention projections are trained or applied during generation. The design eliminates 94% of training/inference FLOPs for 1B-parameter models at context lengths up to 3000, enabling scaling to 8B parameters without prohibitive compute (Zhou et al., 28 Aug 2025).

2.3 YOCO: You Only Cache Once (Decoder-Decoder Stack)

YOCO splits transformer inference into:

An efficient "self-decoder" stack (L/2 layers) that computes a single global cache of KV pairs for the entire prompt, using memory-light attention like sliding window or gated retention.
A "cross-decoder" stack (L/2 layers) that, during each generation step, attends only to this pre-computed cache, never creating new per-layer, per-token KV pairs.
This scheme achieves O(N·D) memory and O(L·N·D) prefill, as opposed to O(L·N·D) and O(L·N²·D) for standard decoders, and matches standard model outputs identically under equivalent context (Sun et al., 2024).

2.4 Decoder-Only Vision and Speech

In LessNet for 3D registration, all learnable architecture is in the decoder. Inputs to the decoder consist of multi-scale, handcrafted pooling features from input images, completely eliminating paradigm-typical convolutional encoders (Jia et al., 2024).

In streaming ASR, blockwise speech features—compressed through CTC heads and low-dimensional context vectors—are concatenated into a prompt for an autoregressive decoder-only stack. Training uses random-length prefix prompts for robustness to stream truncation, and empirical results match or surpass encoder–decoder baselines in both accuracy and real-time factor (Tsunoo et al., 2024).

3. Quantitative Gains and Empirical Results

Lazy decoder-only architectures consistently report substantial compute and efficiency gains:

Domain/Method	Compute Savings	Accuracy Degradation	Reference
DMTD (LLMs, τ=4)	~2× read speed	≤4% aggregate accuracy drop	(Luo et al., 13 Oct 2025)
OneRec-V2 (RecSys)	94% FLOP, 90% training cut	Convergence loss nearly unchanged	(Zhou et al., 28 Aug 2025)
YOCO (LMs, 3B param)	~9–80× memory reduction	Identical output; <6s→180s prefill	(Sun et al., 2024)
Streaming ASR	~2× real-time speedup	8% relative WER reduction	(Tsunoo et al., 2024)
LessNet (Vision)	~100× parameter/frugal encoder	Dice matches or exceeds baselines	(Jia et al., 2024)
Surface Code QEC	50–1500× reduction in decoders	No loss in logical error rate (good p)	(Delfosse, 2020)

In LLMs, DMTD delivers 2× speedup with τ=4 on Qwen3-4B at ≤4% accuracy drop, and improvements grow with model scale and dataset size. YOCO shows orders-of-magnitude gains in memory and prefill latencies, e.g., 1M-token context with 9× less memory than standard transformers.

OneRec-V2's lazy decoder allows 1B models to run with ~1/10th the GPU time versus prior encoder–decoder recommendation models, while maintaining or improving online recommendation KPIs.

4. Variants: Dynamic Layer Execution and Adaptive Computation

Not all lazy decoder designs are static or fixed; many variants incorporate dynamic or data-adaptive computation allocation:

Dynamic layer selection schemes employ layer skipping or early exiting to minimize per-token or per-sequence compute under accuracy constraints (Glavas et al., 2024). Empirical findings highlight that layer skipping, especially static or sequence-level allocation, preserves performance even at 23.3% average computation. Token-level controllers struggle to extract reliable signals from high-dimensional hidden states in large decoder-only LMs, so per-sequence adaptation is more effective.
In quantum decoding, the lazy decoder forms the fast path and hands off to a high-precision decoder only on ambiguous or complex syndromes (Delfosse, 2020).

5. Engineering and Implementation Trade-offs

Lazy decoder-only architectures entail distinctive engineering patterns:

Extensive use of KV-caching and cache freezing strategies; in DMTD, separate caches are kept for early/middle ("amortized") and late layers, with cycle-based refilling (Luo et al., 13 Oct 2025).
Elimination of full-encoder regimes or recomputation through token-level K/V precomputation (YOCO, OneRec-V2).
Increased pipeline simplicity: no additional heads, routers, or verification routines required, supporting transparent integration into existing architectures.
Latency/throughput benefits are realized most strongly for large models, large context lengths, or where input encoding dominates cost.

A plausible implication is that lazy decoder-only architectures are especially impactful in scenarios where sequence length, parameter scale, or hardware constraints would otherwise impose prohibitive resource use.

6. Domain Extensions and Future Directions

The lazy decoder-only pattern has been instantiated in:

Quantum error correction decoders for surface codes, color codes, hyperbolic codes, and LDPC codes, with hardware-in-the-loop (Delfosse, 2020).
End-to-end recommendation, streaming ASR, and 3D image registration, leveraging bespoke context compression, prompt engineering, or feature pooling.
Long-context LMs operating at O(N) memory and O(N) prefill (Sun et al., 2024).

Possible extensions include:

Hierarchical or multi-stage cascades of lazy decoders of growing sophistication.
Refined dynamic allocation via sequence-level controllers informed by cost–performance trade-off surfaces.
ASIC or FPGA hardware implementations for low-latency, memory-efficient deployment.
Application to additional domains such as video, RL planning, or molecular sequence modeling where context reuse is high.

7. Limitations and Outlook

Lazy decoder-only methods entail trade-offs:

Minor but dataset/model-size-sensitive performance loss due to stale or amortized representations.
For DMTD and similar methods, the cycle length τ balances throughput against degradation—best speedups (τ~4–6) may not be admissible for highest accuracy applications (Luo et al., 13 Oct 2025).
In dynamic-layer execution, per-token controllers are typically dominated by weaker static or sequence-level allocations (Glavas et al., 2024).

Nonetheless, for large-scale, resource-constrained, or latency-critical applications, lazy decoder-only architectures reconfigure the cost–benefit trade space across language, vision, recommendation, speech, and quantum error correction domains. Empirical results support their competitiveness with classically encoder–decoder or deep-stack models by leveraging representation reuse and eliminating ineffective computation.