Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deferred Decoding in Sequence Generation

Updated 17 June 2026
  • Deferred decoding is a strategy in sequence generation that defers expensive context encoding computations to improve efficiency in transformer and diffusion models.
  • It employs techniques like layer partitioning, block reuse, and dynamic confidence-based early exits to reduce per-token computational load.
  • Empirical approaches such as DMTD, DEED, and AdaDecode demonstrate significant speedups with minimal quality loss in high-throughput and long-sequence tasks.

Deferred decoding is a class of inference paradigms in sequence generation, particularly relevant for transformer-based LLMs and diffusion LLMs. Its defining characteristic is that a subset of model computations—the "expensive" or context encoding layers—are executed less frequently or only when needed, whereas other parts of the computation (often the "decoding" or output-generating layers) are performed repeatedly for each token or at dynamically determined points. This approach is motivated by empirical observations that in LLMs, early and middle layers primarily serve to encode context and perform high-level reasoning, while late layers are more directly tied to token-level prediction. Deferred decoding strategies exploit this by either amortizing or adapting context encoding, or by deferring certain computations based on model confidence or structural properties, yielding significant reductions in average per-token computation and overall inference latency without substantial loss of performance (Luo et al., 13 Oct 2025, Tang et al., 2023, Wei et al., 4 Jun 2025, Shu et al., 5 Jan 2026).

1. Fundamental Principles and Rationale

The core principle of deferred decoding is selective or adaptive evaluation along the depth and/or temporal axes of sequence generation models.

Layerwise Partitioning in Transformers

A standard transformer can be decomposed into LL layers, with layers f1:mf_{1:m} designated as "encoding + thinking" and layers fm+1:Lf_{m+1:L} as "decoding" (Luo et al., 13 Oct 2025). Deferred decoding leverages this nonuniform functional specialization by:

  • Amortizing expensive early/middle computations: Compute f1:mf_{1:m} once for a block of tokens, reusing the resulting hidden states for several output steps.
  • Frequent reapplication of lightweight late layers: Apply fm+1:Lf_{m+1:L} for each output token, using the previously computed context embedding.

Dynamic Confidence-Based Deferment

Several schemes further incorporate dynamic, per-token confidence-based decision rules. Whenever the model achieves high-confidence predictions at a shallow layer, further (deeper) computation for that token is deferred or completely omitted unless needed for output consistency.

Structural Motivation in Diffusion Models

In diffusion LLMs, deferred decoding counters the "Boundary-Induced Context Truncation" (BICT) problem, where block-based decoding forces early commitment of some tokens without sufficient future context. Deferring these commitments allows for better utilization of available context and reduces error propagation, especially near block boundaries (Shu et al., 5 Jan 2026).

2. Methodological Variants

Deferred decoding encompasses several concrete algorithmic instantiations, differing in model architecture (decoder-only, encoder-decoder, diffusion), policy for deferral, and verification mechanism.

Direct Multi-Token Decoding (DMTD)

DMTD (Luo et al., 13 Oct 2025) is prototypical for decoder-only transformers. The model is split at layer mm:

  • Compute hpre=f1:m(x<t)h_{pre}=f_{1:m}(x_{<t}) for prefix x<tx_{<t} once per block of Ï„\tau tokens.
  • For each token tt in the block, generate logits via f1:mf_{1:m}0 for the first, and recursively f1:mf_{1:m}1 for subsequent tokens, only updating late layers.
  • Early/middle KV-caches are "refilled" cylically after each block.

Dynamic Early Exit Decoding (DEED)

DEED (Tang et al., 2023) for encoder-decoder architectures augments the decoder with prediction heads at every layer and adaptation modules for shallower exits. At each generation step and layer:

  • Compute the token-level prediction and confidence.
  • If the confidence exceeds threshold f1:mf_{1:m}2, exit early and avoid deeper computation for that step.
  • Deeper-layer features for prior steps are (re-)computed "just-in-time" as required for semantic alignment, with corresponding KV-cache management.

AdaDecode

AdaDecode (Wei et al., 4 Jun 2025) for LLMs employs independence heads on intermediate layers, enabling early exit and emitting tokens if confidence f1:mf_{1:m}3. The outstanding deeper-layer computations for such tokens are queued and executed in parallel, ensuring compute overlap. Final-layer verification rejects or accepts based on exact distributional agreement, yielding output parity with standard autoregressive decoding.

Deferred Commitment Decoding (DCD) in Diffusion Models

DCD (Shu et al., 5 Jan 2026) for DLMs replaces hard block-based token commitments with a sliding window and per-position top-1 confidence scoring. Tokens with f1:mf_{1:m}4 are committed, while others are deferred until sufficient left/right context is available, ensuring more robust bidirectional information flow and minimizing premature prediction.

3. Mathematical Formulation and Inference Algorithms

All effective deferred decoding systems rely on mathematical mechanisms to quantify when to defer computation, and on inference algorithms to schedule forward passes and manage (potentially partial) KV-caches.

In Transformer Architectures

DMTD Partitioning:

Given a model f1:mf_{1:m}5, partition as f1:mf_{1:m}6, hidden state after early layers f1:mf_{1:m}7.

  • Block size f1:mf_{1:m}8.
  • Per-block, perform a single f1:mf_{1:m}9, multiple fm+1:Lf_{m+1:L}0.

AdaDecode Confidence and Verification:

For each token fm+1:Lf_{m+1:L}1 and layer fm+1:Lf_{m+1:L}2:

  • Sample fm+1:Lf_{m+1:L}3.
  • If fm+1:Lf_{m+1:L}4, emit token and defer deeper computation.
  • At the final layer, apply rejection sampling:

fm+1:Lf_{m+1:L}5

Rollback occurs for rare mismatches (fm+1:Lf_{m+1:L}66% for fm+1:Lf_{m+1:L}7).

DEED Just-in-Time:

Maintain buffers for KV and hidden states per layer, recompute missing deeper-layer features exactly when required due to heterogeneous exit depths.

In Diffusion LLMs

Deferred Commitment:

At substep fm+1:Lf_{m+1:L}8 over sliding window fm+1:Lf_{m+1:L}9, commit all token positions with f1:mf_{1:m}0. If none, commit f1:mf_{1:m}1 f1:mf_{1:m}2, ensure progress.

Window and KV-Cache Management:

Dual caches maintained for prefix and suffix, recomputed or refreshed only after f1:mf_{1:m}3 commitments, ensuring efficient full-attention within the active window.

4. Computational Performance and Practical Impact

Deferred decoding yields substantial reductions in inference latency and computational load with controllable loss (and sometimes even gains) in output quality.

Method Max Speedup Output Quality Loss Architectural Requirements
DMTD (Luo et al., 13 Oct 2025) 2.15× (batch=1) ≤3.7% (for τ≤4) Model partitioning, no aux. params
AdaDecode (Wei et al., 4 Jun 2025) 1.73× (largest LM) ≤0.5% (parity) Lightweight heads, final-layer verification
DEED (Tang et al., 2023) 2.3× (at τ=0.99) None or slight gain Multi-exit training, shallow adapters
DCD (Shu et al., 5 Jan 2026) Comparable/lower +1.39% avg gain DLMs, confidence-window scoring

In DMTD using Qwen3-4B (36 layers, τ=4, L_d=8), up to 2.15× speedup was observed with only a 3.7% accuracy degradation (Luo et al., 13 Oct 2025). AdaDecode achieves 1.73× throughput with ≥99.5% string-match parity, enabled by high rates of early exit at intermediate layers and robust parallel scheduling (Wei et al., 4 Jun 2025). DEED reduced decoder latency by 39–56% across VL decoding tasks without loss, by aggressively exiting at shallow layers (Tang et al., 2023). DCD in DLMs reported +1.39% average accuracy improvement, attributed to improved confidence-based context utilization, without notable inference slowdown (Shu et al., 5 Jan 2026).

5. Empirical Design Choices and Trade-Offs

Performance and accuracy depend sensitively on block/window size, confidence thresholds, and cache management.

  • Block size (DMTD): Short blocks (f1:mf_{1:m}4) lead to minimal loss, longer (f1:mf_{1:m}5≥6) degrade quality (~18%).
  • Confidence thresholds: High (e.g., f1:mf_{1:m}6 in AdaDecode, Ï„=0.99 in DEED) minimizes errors but reduces speedup; moderate thresholds improve speed at small cost.
  • Cache scheduling: Closely tied to hardware; deferred layers and just-in-time cache filling maximize hardware occupancy, especially for high batch sizes (Wei et al., 4 Jun 2025).
  • Sliding window (DCD): Optimal f1:mf_{1:m}7, f1:mf_{1:m}8 values balance context richness and progress rate, with too small/large values harming efficiency or quality.

6. Relationship to Speculative and Early-Exit Decoding

Deferred decoding differs fundamentally from speculative decoding frameworks (e.g., Speculative Decoding, EAGLE, Medusa):

  • No auxiliary draft/model: Deferred methods like DMTD, AdaDecode, and DEED require neither a secondary model nor additional decoding heads (with temporary linear projection heads as the only exception) (Luo et al., 13 Oct 2025, Wei et al., 4 Jun 2025).
  • No post-generation verification (except AdaDecode): DMTD and DEED require no post-hoc output validation; AdaDecode performs rejection at final layers but always preserves AR output (Wei et al., 4 Jun 2025).
  • No added parameters, no two-stage routines: No new parameters are introduced in DMTD; AdaDecode's lightweight heads are trained via KL divergence, leaving main model untouched (Luo et al., 13 Oct 2025, Wei et al., 4 Jun 2025).

A plausible implication is that deferred decoding can realize similar or greater efficiency gains than speculative approaches, but with smaller memory and codebase footprint.

7. Broader Implications and Limitations

Deferred decoding generalizes across model families (autoregressive transformer LMs, encoder-decoder VL models, DLMs). It is particularly advantageous for long-sequence generation, high-throughput inference, and deployment on accelerator hardware with substantial parallel compute capacity. Limitations occur in semi-causal diffusion models where block structure is fixed and in cases of intrinsic prediction ambiguity that cannot be resolved even with deferred context exposure. Design of native architectures permitting dynamic context expansion and hybridization with AR/diffusion generation represent open research directions (Shu et al., 5 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deferred Decoding.