Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

Published 12 Apr 2026 in cs.CL and cs.AI | (2604.10567v1)

Abstract: Diffusion-based LLMs (dLLMs) have emerged as a promising alternative to autoregressive LLMs, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that early denoising decisions critically shape output trajectories through proximity bias, leading to premature EOS predictions.
It introduces a lightweight planner and EOS temperature annealing to strategically guide initial token unmasking and improve model accuracy.
Empirical results reveal improvements in tasks like GSM8K and Sudoku, with accuracy gains up to +10.2% in constrained computational settings.

Early Decisions Matter in Non-Autoregressive Diffusion LLMs

Background and Motivation

Diffusion-based LLMs (dLLMs) offer fundamental advantages over traditional autoregressive LLMs, notably parallel token generation and bidirectional context utilization. These properties theoretically promise higher inference efficiency and richer sequence modeling. However, prior work has demonstrated that fully non-autoregressive (NAR) decoding frequently fails to deliver coherent outputs, particularly in reasoning and planning scenarios, leading practitioners to revert to semi-autoregressive approaches that impose sequential constraints and undermine the principal benefits of dLLMs.

This paper rigorously interrogates the failure modes intrinsic to confidence-driven NAR decoding, focusing on the temporal dynamics of inference in dLLMs. The study elucidates the "proximity bias," a tendency for the denoising order to concentrate on spatially adjacent tokens, and characterizes how early unmasking decisions disproportionately anchor the entire trajectory, often resulting in premature termination via End-of-Sequence (EOS) predictions.

Analysis of Proximity Bias and Trajectory Dynamics

Proximity Bias in NAR Decoding

Empirical analysis across mathematical reasoning and planning tasks evidences that increased diffusion timesteps do not monotonically improve generation quality in the NAR regime. Instead, excessive steps exacerbate spatial error propagation, with valid content tokens often being occluded by premature EOS tokens predicted at later sequence positions. The proximity bias manifests as sequential unmasking of spatial neighbors of previously unmasked tokens, effectively collapsing NAR decoding into a reverse-autoregressive process. Under high initial uncertainty, structural tokens such as EOS gain disproportionate model confidence, leading to irreversible trajectory anchoring.

Temporal Asymmetry and Initial Decision Criticality

The study reveals a sharp temporal asymmetry: the viability of the final output is largely determined by token and position selections made in the very first denoising step. Once proximity bias begins to propagate—from a poor initial choice (e.g., EOS at sequence end)—subsequent confidence-based unmasking rapidly constrains the generation window, severely limiting the capacity for logic or content extension. Experiments show that diversity introduced during the initial step (randomized position sampling) is far more effective than stochasticity applied uniformly across timesteps in mitigating this collapse.

Minimal-Intervention Solution: Initial Trajectory Shaping

Lightweight Planner and EOS Temperature Annealing

To target the temporal asymmetry observed, the paper introduces two computationally minimal interventions:

Lightweight Planner: A 5M-parameter transformer encoder predicts optimal denoising positions for the first step, using only candidate position hidden states rather than full sequence context. The planner is trained offline using greedy decoding trajectories, maximizing task-level reward (correctness).
EOS Temperature Annealing: A dynamic temperature scaling suppresses the EOS token's dominance during high-uncertainty early steps, which mitigates premature generation window closure. The EOS logit is annealed linearly, preserving natural stopping behavior in later steps.

Both interventions act exclusively at the initial step, ensuring compatibility with downstream confidence-based greedy decoding and minimal computational overhead.

Numerical Results

Significant accuracy improvements are reported on reasoning-intensive tasks across both low- and high-compute regimes. On GSM8K under constrained computation, the combination of planner and EOS annealing elevates accuracy from 44.8% (greedy baseline) to 47.6%, with task-specific gains up to +10.2%. The planner outperforms both random initialization and temperature-based token sampling, demonstrating robustness even in structurally constrained tasks like Sudoku, where randomness would otherwise severely degrade performance.

Ablation studies show that the effect of the planner generalizes to larger timestep budgets (T = 64, 128), consistently outperforming baseline and stochastic approaches. The candidate pool size (P) analysis identifies P = 32 as the optimal trade-off point. Applications to other models (Dream 7B Instruct) and tasks validate the approach's universality.

Implications and Future Directions

The findings underscore that early trajectory decisions in NAR dLLM decoding are critical, introducing an irreversible path dependency driven by proximity bias. The deployed lightweight planner and EOS annealing strategies successfully mitigate structural collapse without compromising inference speed or parallelism.

Practically, this enables fully non-autoregressive dLLMs to achieve competitive performance in reasoning and planning tasks, unlocking the long-awaited parallel decoding potential. The modular interventions are orthogonal to existing sampling heuristics, suggesting broad applicability in industrial low-latency deployment scenarios and multi-step instruction-following frameworks.

However, questions remain regarding open-ended generation (e.g., long-form text, creative writing), where structural priors are less pronounced and proximity bias dynamics may differ. Systematic integration of richer planner architectures, trajectory rollouts, and LLM-based evaluation may further refine trajectory shaping and output quality.

Conclusion

This work delivers a rigorous spatiotemporal dissection of NAR dLLM decoding, identifying proximity bias and early decision anchoring as primary obstacles. Through minimal yet effective interventions—a lightweight planner and targeted EOS annealing—the paper achieves substantial quality improvements while maintaining efficiency. These results advance practical parallel NAR decoding and highlight new avenues for optimization in diffusion-based sequence generation (2604.10567).

Markdown Report Issue