Papers
Topics
Authors
Recent
Search
2000 character limit reached

Foreseeing Decoding Method (FDM)

Updated 22 June 2026
  • FDM is a decoding strategy for LLDMs that integrates local token confidence with global impact to optimize sequence generation.
  • It employs a search-based approach to evaluate downstream token effects, mitigating error propagation compared to local-only heuristics.
  • FDM–A adapts decoding phases to balance full search and acceleration, achieving near-optimal accuracy with significantly improved throughput.

The Foreseeing Decoding Method (FDM) is a decoding strategy for Large Language Diffusion Models (LLDMs) that integrates both local and global information about token selection to optimize sequence generation. LLDMs, which iteratively denoise a fully masked sequence and allow for non-autoregressive, parallelized inference, exhibit high sensitivity to the "decoding order"—the sequence in which masked positions are filled. FDM addresses critical limitations of existing heuristics by employing a search-based approach that considers the downstream impact of token choices, thereby offering improved accuracy and efficiency compared to traditional local-only methods (Mo et al., 3 Dec 2025).

1. Decoding-Order Sensitivity in Large Language Diffusion Models

LLDMs operate by iteratively filling in masked positions within a sequence, deviating fundamentally from left-to-right autoregressive paradigms. Their core advantage—parallelizable generation—introduces a challenging non-convex denoising path: predictions at one step are contingent upon all prior token choices, amplifying the consequences of early errors. Thus, the specific order in which positions are sampled ("decoding order") has outsized influence on the eventual output, with potentially divergent generations emerging from different sampling paths.

Heuristic decoding strategies prevalent in LLDMs—such as selecting tokens by highest local probability, margin, or lowest entropy—employ only local confidence at each step. While these outperform random ordering, they inherently disregard the longer-term implications of token selection, making them vulnerable to error propagation and suboptimal global outcomes (Mo et al., 3 Dec 2025).

2. Mathematical Foundations of FDM

FDM formalizes decoding as a search process over possible token selections, explicitly maximizing both immediate token likelihood and downstream sequence quality. Given a prompt qq, vocabulary MM, and diffusion length TT, the optimal decoding objective is:

xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})

This can be decomposed to make explicit each token's influence:

xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})

At inference, the data distribution pdatap_{\text{data}} is replaced with the trained model pθp_\theta, giving the FDM policy:

πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)

With log-transformations:

Clocal(xt)=logpθ(xtq,xt1) Cglobal(xt)=logpθ(xTq,xt) S(xt)=Clocal(xt)+Cglobal(xt)C_\text{local}(x_t) = \log p_\theta(x_t \mid q, x_{t-1}) \ C_\text{global}(x_t) = \log p_\theta(x_T \mid q, x_t) \ S(x_t) = C_\text{local}(x_t) + C_\text{global}(x_t)

FDM theoretically guarantees a tighter Kullback-Leibler (KL) divergence with respect to the true data distribution compared to any purely local policy πH\pi_H, by an amount given by the conditional mutual informations MM0:

MM1

3. Foreseeing Decoding Algorithmic Procedure

The FDM procedure maintains a partially masked sequence at each step. Its workflow can be summarized as follows:

  1. Candidate generation & pruning: Compute MM2 and select the top-MM3 candidate tokens. Tokens with local confidence below a threshold MM4 are pruned.
  2. Top–K narrowing: Rank remaining candidates by MM5, keeping the top-MM6 as set MM7.
  3. Foreseeing selection:
    • If MM8 is empty, select by MM9 only.
    • Else, for each TT0, compute TT1 via forward passes and choose the token maximizing TT2.
  4. Update: Fill selected TT3 into TT4; continue until all masks are filled.

Pseudocode for a decoding step includes parallelizable forward passes for computing TT5, and TT6-pruning alongside top-TT7 narrowing reduces computational burden. The number of extra forward passes is proportional to TT8 (typically TT9–xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})0) per decoding step.

4. FDM with Acceleration (FDM–A)

FDM–A introduces an adaptive, phase-aware acceleration scheme based on empirical observations of the "consistency ratio"—the agreement between local-only and local+global selections. Early in decoding, this consistency is low (≈50%), warranting full FDM search; later, it exceeds 90%, justifying local-only acceleration. FDM–A operates in three regimes determined by thresholds xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})1:

  1. Exploration phase: If no unmasked position has xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})2, apply full FDM with width xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})3.
  2. Balance phase: If tokens exist with xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})4 (set xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})5) or in range xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})6 (set xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})7), decode xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})8 tokens in parallel using FDM with width xT=argmaxxTpdata(x0)t=1Tpdata(xtq,x0:t1)x_T^* = \arg\max_{x_T} p_{\text{data}}(x_0) \prod_{t=1}^T p_{\text{data}}(x_t \mid q, x_{0:t-1})9, threshold xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})0.
  3. Acceleration phase: When many confident tokens are present, decode up to xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})1 tokens in parallel using purely local selection (xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})2, xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})3).

Hyperparameters (xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})4, xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})5, xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})6, xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})7, xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})8) are set based on resource constraints and domain validation. FDM–A implements these phases to minimize global search overhead while maintaining performance.

5. Computational Complexity and Implementation

Each FDM step involves up to xT=argmaxxtpdata(xtq,x0:t1)×pdata(xt+1:Tq,x0:t)x_T^* = \arg\max_{x_t} p_{\text{data}}(x_t \mid q, x_{0:t-1}) \times p_{\text{data}}(x_{t+1:T} \mid q, x_{0:t})9 (vocabulary size) probability queries per token and pdatap_{\text{data}}0 extra forward passes for global scoring. pdatap_{\text{data}}1-pruning and top–pdatap_{\text{data}}2 narrowing restrict the computational load; all pdatap_{\text{data}}3 forward passes can be parallelized on GPU hardware. The method is architecture-agnostic—no model changes are required—and can exploit KV-caching or block attention for repeated context encoding. Key hyperparameters include pdatap_{\text{data}}4 (search width), pdatap_{\text{data}}5 (local pruning threshold), pdatap_{\text{data}}6 and pdatap_{\text{data}}7 (FDM–A phase boundaries), and pdatap_{\text{data}}8 (max parallel tokens).

Step Main Operation Complexity
Token candidate pdatap_{\text{data}}9 probs + pθp_\theta0-pruning pθp_\theta1
Global scoring pθp_\theta2 forward passes (can be parallelized) pθp_\theta3
Context update Sequence fill, no extra memory/disk requirements pθp_\theta4 (per token)

6. Experimental Evaluation

FDM and FDM–A were benchmarked on GSM8K (mathematical reasoning), HumanEval (code generation), Countdown (arithmetic), and ARC (commonsense question answering), using LLDM variants LLaDA-8B-Instruct, LLaDA-1.5, LLaDA-MoE-7B, and MMaDA-8B. Metrics included accuracy (% correct answers) and throughput (tokens per second, TPS).

Key results:

  • On ARC with LLaDA-8B, FDM (pθp_\theta5) achieved 86.00% accuracy vs. 82.55% for the best heuristic (margin-based), with TPS 7.72 vs. 10.85 (sequence length pθp_\theta6).
  • Increasing pθp_\theta7 to 4 yielded up to pθp_\theta8 absolute accuracy at a cost of approximately 40% TPS reduction.
  • FDM–A delivered near-FDM accuracy with 3–5pθp_\theta9 speed-up: on ARC, accuracy 86.30% (vs. 86.00% for FDM), TPS 38.20 (vs. 7.72).
  • Consistent performance and efficiency trends were observed across all four benchmarks and model architectures (Mo et al., 3 Dec 2025).

7. Contributions, Limitations, and Future Directions

FDM introduces the global confidence term πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)0 to directly quantify the long-term impact of token selection, moving beyond the locality of prior heuristics. Its search-based approach is theoretically guaranteed to yield a KL divergence closer to the data distribution than purely local methods, proportional to the sum of conditional mutual informations. FDM–A further enables adaptive acceleration by restricting full search to critical early steps, realizing a favorable efficiency–performance trade-off.

Limitations include increased inference cost due to multiple forward passes per decoding step and the necessity of empirical hyperparameter tuning (πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)1, πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)2, πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)3, πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)4, πF(xtq,xt1)=argmaxxtpθ(xtq,xt1)pθ(xTq,xt)\pi_F(x_t \mid q, x_{t-1}) = \arg\max_{x_t} p_\theta(x_t \mid q, x_{t-1}) \cdot p_\theta(x_T \mid q, x_t)5) per model or domain. Potential extensions involve learning or adaptively predicting global scores to obviate extra forward passes, integration with advanced diffusion samplers or dynamic step-sizing, and application to conditional or multimodal diffusion (e.g., text-to-image) (Mo et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Foreseeing Decoding Method (FDM).