Deferred Decoding in Sequence Generation

Updated 17 June 2026

Deferred decoding is a strategy in sequence generation that defers expensive context encoding computations to improve efficiency in transformer and diffusion models.
It employs techniques like layer partitioning, block reuse, and dynamic confidence-based early exits to reduce per-token computational load.
Empirical approaches such as DMTD, DEED, and AdaDecode demonstrate significant speedups with minimal quality loss in high-throughput and long-sequence tasks.

Deferred decoding is a class of inference paradigms in sequence generation, particularly relevant for transformer-based LLMs and diffusion LLMs. Its defining characteristic is that a subset of model computations—the "expensive" or context encoding layers—are executed less frequently or only when needed, whereas other parts of the computation (often the "decoding" or output-generating layers) are performed repeatedly for each token or at dynamically determined points. This approach is motivated by empirical observations that in LLMs, early and middle layers primarily serve to encode context and perform high-level reasoning, while late layers are more directly tied to token-level prediction. Deferred decoding strategies exploit this by either amortizing or adapting context encoding, or by deferring certain computations based on model confidence or structural properties, yielding significant reductions in average per-token computation and overall inference latency without substantial loss of performance (Luo et al., 13 Oct 2025, Tang et al., 2023, Wei et al., 4 Jun 2025, Shu et al., 5 Jan 2026).

1. Fundamental Principles and Rationale

The core principle of deferred decoding is selective or adaptive evaluation along the depth and/or temporal axes of sequence generation models.

Layerwise Partitioning in Transformers

A standard transformer can be decomposed into $L$ layers, with layers $f_{1:m}$ designated as "encoding + thinking" and layers $f_{m+1:L}$ as "decoding" (Luo et al., 13 Oct 2025). Deferred decoding leverages this nonuniform functional specialization by:

Amortizing expensive early/middle computations: Compute $f_{1:m}$ once for a block of tokens, reusing the resulting hidden states for several output steps.
Frequent reapplication of lightweight late layers: Apply $f_{m+1:L}$ for each output token, using the previously computed context embedding.

Dynamic Confidence-Based Deferment

Several schemes further incorporate dynamic, per-token confidence-based decision rules. Whenever the model achieves high-confidence predictions at a shallow layer, further (deeper) computation for that token is deferred or completely omitted unless needed for output consistency.

Structural Motivation in Diffusion Models

In diffusion LLMs, deferred decoding counters the "Boundary-Induced Context Truncation" (BICT) problem, where block-based decoding forces early commitment of some tokens without sufficient future context. Deferring these commitments allows for better utilization of available context and reduces error propagation, especially near block boundaries (Shu et al., 5 Jan 2026).

2. Methodological Variants

Deferred decoding encompasses several concrete algorithmic instantiations, differing in model architecture (decoder-only, encoder-decoder, diffusion), policy for deferral, and verification mechanism.

Direct Multi-Token Decoding (DMTD)

DMTD (Luo et al., 13 Oct 2025) is prototypical for decoder-only transformers. The model is split at layer $m$ :

Compute $h_{pre}=f_{1:m}(x_{<t})$ for prefix $x_{<t}$ once per block of $\tau$ tokens.
For each token $t$ in the block, generate logits via $f_{1:m}$ 0 for the first, and recursively $f_{1:m}$ 1 for subsequent tokens, only updating late layers.
Early/middle KV-caches are "refilled" cylically after each block.

Dynamic Early Exit Decoding (DEED)

DEED (Tang et al., 2023) for encoder-decoder architectures augments the decoder with prediction heads at every layer and adaptation modules for shallower exits. At each generation step and layer:

Compute the token-level prediction and confidence.
If the confidence exceeds threshold $f_{1:m}$ 2, exit early and avoid deeper computation for that step.
Deeper-layer features for prior steps are (re-)computed "just-in-time" as required for semantic alignment, with corresponding KV-cache management.

AdaDecode

AdaDecode (Wei et al., 4 Jun 2025) for LLMs employs independence heads on intermediate layers, enabling early exit and emitting tokens if confidence $f_{1:m}$ 3. The outstanding deeper-layer computations for such tokens are queued and executed in parallel, ensuring compute overlap. Final-layer verification rejects or accepts based on exact distributional agreement, yielding output parity with standard autoregressive decoding.

Deferred Commitment Decoding (DCD) in Diffusion Models

DCD (Shu et al., 5 Jan 2026) for DLMs replaces hard block-based token commitments with a sliding window and per-position top-1 confidence scoring. Tokens with $f_{1:m}$ 4 are committed, while others are deferred until sufficient left/right context is available, ensuring more robust bidirectional information flow and minimizing premature prediction.

3. Mathematical Formulation and Inference Algorithms

All effective deferred decoding systems rely on mathematical mechanisms to quantify when to defer computation, and on inference algorithms to schedule forward passes and manage (potentially partial) KV-caches.

In Transformer Architectures

DMTD Partitioning:

Given a model $f_{1:m}$ 5, partition as $f_{1:m}$ 6, hidden state after early layers $f_{1:m}$ 7.

Block size $f_{1:m}$ 8.
Per-block, perform a single $f_{1:m}$ 9, multiple $f_{m+1:L}$ 0.

AdaDecode Confidence and Verification:

For each token $f_{m+1:L}$ 1 and layer $f_{m+1:L}$ 2:

Sample $f_{m+1:L}$ 3.
If $f_{m+1:L}$ 4, emit token and defer deeper computation.
At the final layer, apply rejection sampling:

$f_{m+1:L}$ 5

Rollback occurs for rare mismatches ( $f_{m+1:L}$ 66% for $f_{m+1:L}$ 7).

DEED Just-in-Time:

Maintain buffers for KV and hidden states per layer, recompute missing deeper-layer features exactly when required due to heterogeneous exit depths.

In Diffusion LLMs

Deferred Commitment:

At substep $f_{m+1:L}$ 8 over sliding window $f_{m+1:L}$ 9, commit all token positions with $f_{1:m}$ 0. If none, commit $f_{1:m}$ 1 $f_{1:m}$ 2, ensure progress.

Window and KV-Cache Management:

Dual caches maintained for prefix and suffix, recomputed or refreshed only after $f_{1:m}$ 3 commitments, ensuring efficient full-attention within the active window.

4. Computational Performance and Practical Impact

Deferred decoding yields substantial reductions in inference latency and computational load with controllable loss (and sometimes even gains) in output quality.

Method	Max Speedup	Output Quality Loss	Architectural Requirements
DMTD (Luo et al., 13 Oct 2025)	2.15× (batch=1)	≤3.7% (for τ≤4)	Model partitioning, no aux. params
AdaDecode (Wei et al., 4 Jun 2025)	1.73× (largest LM)	≤0.5% (parity)	Lightweight heads, final-layer verification
DEED (Tang et al., 2023)	2.3× (at τ=0.99)	None or slight gain	Multi-exit training, shallow adapters
DCD (Shu et al., 5 Jan 2026)	Comparable/lower	+1.39% avg gain	DLMs, confidence-window scoring

In DMTD using Qwen3-4B (36 layers, τ=4, L_d=8), up to 2.15× speedup was observed with only a 3.7% accuracy degradation (Luo et al., 13 Oct 2025). AdaDecode achieves 1.73× throughput with ≥99.5% string-match parity, enabled by high rates of early exit at intermediate layers and robust parallel scheduling (Wei et al., 4 Jun 2025). DEED reduced decoder latency by 39–56% across VL decoding tasks without loss, by aggressively exiting at shallow layers (Tang et al., 2023). DCD in DLMs reported +1.39% average accuracy improvement, attributed to improved confidence-based context utilization, without notable inference slowdown (Shu et al., 5 Jan 2026).

5. Empirical Design Choices and Trade-Offs

Performance and accuracy depend sensitively on block/window size, confidence thresholds, and cache management.

Block size (DMTD): Short blocks ( $f_{1:m}$ 4) lead to minimal loss, longer ( $f_{1:m}$ 5≥6) degrade quality (~18%).
Confidence thresholds: High (e.g., $f_{1:m}$ 6 in AdaDecode, τ=0.99 in DEED) minimizes errors but reduces speedup; moderate thresholds improve speed at small cost.
Cache scheduling: Closely tied to hardware; deferred layers and just-in-time cache filling maximize hardware occupancy, especially for high batch sizes (Wei et al., 4 Jun 2025).
Sliding window (DCD): Optimal $f_{1:m}$ 7, $f_{1:m}$ 8 values balance context richness and progress rate, with too small/large values harming efficiency or quality.

6. Relationship to Speculative and Early-Exit Decoding

Deferred decoding differs fundamentally from speculative decoding frameworks (e.g., Speculative Decoding, EAGLE, Medusa):

No auxiliary draft/model: Deferred methods like DMTD, AdaDecode, and DEED require neither a secondary model nor additional decoding heads (with temporary linear projection heads as the only exception) (Luo et al., 13 Oct 2025, Wei et al., 4 Jun 2025).
No post-generation verification (except AdaDecode): DMTD and DEED require no post-hoc output validation; AdaDecode performs rejection at final layers but always preserves AR output (Wei et al., 4 Jun 2025).
No added parameters, no two-stage routines: No new parameters are introduced in DMTD; AdaDecode's lightweight heads are trained via KL divergence, leaving main model untouched (Luo et al., 13 Oct 2025, Wei et al., 4 Jun 2025).

A plausible implication is that deferred decoding can realize similar or greater efficiency gains than speculative approaches, but with smaller memory and codebase footprint.

7. Broader Implications and Limitations

Deferred decoding generalizes across model families (autoregressive transformer LMs, encoder-decoder VL models, DLMs). It is particularly advantageous for long-sequence generation, high-throughput inference, and deployment on accelerator hardware with substantial parallel compute capacity. Limitations occur in semi-causal diffusion models where block structure is fixed and in cases of intrinsic prediction ambiguity that cannot be resolved even with deferred context exposure. Design of native architectures permitting dynamic context expansion and hybridization with AR/diffusion generation represent open research directions (Shu et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (4)

Direct Multi-Token Decoding (2025)

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models (2023)

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism (2025)

Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deferred Decoding.

Deferred Decoding in Sequence Generation

1. Fundamental Principles and Rationale

Layerwise Partitioning in Transformers

Dynamic Confidence-Based Deferment

Structural Motivation in Diffusion Models

2. Methodological Variants

Direct Multi-Token Decoding (DMTD)

Dynamic Early Exit Decoding (DEED)

AdaDecode

Deferred Commitment Decoding (DCD) in Diffusion Models

3. Mathematical Formulation and Inference Algorithms

In Transformer Architectures

DMTD Partitioning:

AdaDecode Confidence and Verification:

DEED Just-in-Time:

In Diffusion LLMs

Deferred Commitment:

Window and KV-Cache Management:

4. Computational Performance and Practical Impact

5. Empirical Design Choices and Trade-Offs

6. Relationship to Speculative and Early-Exit Decoding

7. Broader Implications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Deferred Decoding in Sequence Generation

1. Fundamental Principles and Rationale

Layerwise Partitioning in Transformers

Dynamic Confidence-Based Deferment

Structural Motivation in Diffusion Models

2. Methodological Variants

Direct Multi-Token Decoding (DMTD)

Dynamic Early Exit Decoding (DEED)

AdaDecode

Deferred Commitment Decoding (DCD) in Diffusion Models

3. Mathematical Formulation and Inference Algorithms

In Transformer Architectures

DMTD Partitioning:

AdaDecode Confidence and Verification:

DEED Just-in-Time:

In Diffusion LLMs

Deferred Commitment:

Window and KV-Cache Management:

4. Computational Performance and Practical Impact

5. Empirical Design Choices and Trade-Offs

6. Relationship to Speculative and Early-Exit Decoding

7. Broader Implications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research