Papers
Topics
Authors
Recent
Search
2000 character limit reached

Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models

Published 10 May 2026 in cs.LG | (2605.09303v1)

Abstract: Diffusion LLMs (DLMs) offer a structural alternative to autoregressive generation: denoising can update tokens in arbitrary orders or in parallel rather than along a fixed left-to-right chain. In practice, fast DLM decoding remains strongly order-sensitive and often drifts toward autoregressive-like trajectories. We trace this tension to compatibility. At each reverse-time step, a DLM provides local denoising conditionals over the unresolved tokens. Arbitrary-order denoising becomes well defined when these local conditionals compose into order-invariant pseudo-joints. We formalize this view by defining order-induced pseudo-joints and a local denoising circulation: the log-ratio between the two pseudo-joints obtained by swapping a pair of unresolved positions. This circulation is zero under compatible conditionals, and global order gaps decompose into sums of local circulations along adjacent swaps. We further separate incompatibility-driven path dependence from conditional-dependence error in parallel updates and from order-specific estimation error. The resulting framework provides inference-only diagnostics for testing when DLM decoding is genuinely order-free.

Authors (1)

Summary

  • The paper introduces a pseudo-joint framework using local curl to quantify order sensitivity in Diffusion Language Models.
  • It decomposes parallel decoding failures into pseudo-joint incompatibility, conditional total correlation, and order-specific estimation errors.
  • The study proposes actionable regularization and scheduling strategies to mitigate order collapse in non-autoregressive text generation.

Path-Dependent Denoising and Order Collapse in Diffusion LLMs

Overview and Motivation

The work "Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion LLMs" (2605.09303) presents a formal analysis of order sensitivity and path dependence in Diffusion LLMs (DLMs). Unlike autoregressive (AR) LLMs that enforce a left-to-right dependency during generation, DLMs permit arbitrary or parallel decoding by iteratively denoising corrupted text. However, empirical evidence shows that fast DLM decoding is still strongly order-sensitive, often reverting toward AR-like patterns even if the model interface does not restrict the order.

This paper addresses the structural underpinnings of this phenomenon, moving beyond task/data-level explanations to establish a rigorous framework rooted in the compatibility of local denoising conditionals and their compositional properties. The central objects of study are "order-induced pseudo-joints" and "local circulation" (curl) quantifying how incompatibility of local conditionals leads to path-dependent generative trajectories in DLMs.

Compatibility, Pseudo-Joints, and Curl

At every step, a DLM provides local conditional probabilities for substituting unresolved tokens. For a set of unresolved positions BB and a permutation (decoding order) π\pi of BB, the sequential product of these conditionals—in a specific order—defines a "pseudo-joint" Qθ,tπ(xB∣xS)Q_{\theta,t}^{\pi}(x_B \mid x_S). Invariance of Qθ,tπQ_{\theta,t}^{\pi} across different orders π\pi signals order-free semantics; variability indicates structural incompatibility of the denoiser’s local conditionals.

The paper introduces a formal diagnostic for such incompatibility: the "local curl" Cija,b(xS,t)C_{ij}^{a,b}(x_S, t), defined as the log-ratio of pseudo-joints across swapped decoding orders for pairs of positions (i,j)(i, j) and token assignments (a,b)(a, b). Curl encapsulates non-conservativity in the field of predictive updates: nonzero curl implies that the sequence of updates—the generative path—matters, thereby exposing the intrinsic path dependence of the model.

Key theoretical results include:

  • Theorem 1: The local curl is precisely the log-density ratio between the two orderings, and its expectation is the corresponding KL divergence between two-order pseudo-joints.
  • Proposition: Global order consistency (true order invariance) on a block is equivalent to vanishing curl (curl-freeness) on every reachable local "square" in the block; this is a discrete analog of conservative fields in vector calculus.

This diagnostic framework distinguishes order-dependence arising from incompatibility (nonzero curl) from issues introduced by other factors such as within-block conditional dependence, providing a clear separation of failure modes in DLM decoding.

Decomposition of Parallel Decoding Failure

A pivotal contribution of the paper is the decomposition of parallel decoding failures into three mechanistically distinct sources:

  1. Pseudo-joint incompatibility (curl): Nonzero curl quantifies path dependence due to structural incompatibility among local conditionals.
  2. Conditional total correlation (TC): Even with compatible conditionals (zero curl), if the unresolved tokens remain strongly dependent, parallel independent updates incur a "TC penalty"—the conditional entropy gap between the joint and the product of marginals (see Theorem 4).
  3. Order-specific estimation error: The model may have varying estimation errors depending on the context order induced by the scheduler; the path that minimizes cumulative conditional estimation error can become preferred (see Theorem 5).

Of particular note, the paper proves (Theorem 3) that in the Bayes-optimal limit under uniform masking, DLMs should exhibit zero curl regardless of the inherent directionality of natural language. Thus, nonzero curl in practice is attributed to limitations in data coverage, model capacity, or optimization—not to language itself.

Empirical and Diagnostic Implications

The theoretical framework operationalizes several diagnostics and validation protocols:

  • Empirical measurement of curl via direct computation of pairwise log-ratios of pseudo-joints at fixed reverse-time steps. This serves as a path-independence diagnostic for order-free generation.
  • TC-proxies assist in evaluating whether observed parallelism failure is explained by conditional dependence rather than incompatibility.
  • Order-specific loss profiling: Quantifies to what extent different decoding orderings induce lower model estimation error, rationalizing the empirical drift toward AR-like schedules.
  • Commutator-based diagnostics: Examine if decoder-specific update operators transform model-level order discrepancies into concrete predictive divergence.

The proposed validation protocols—ranging from synthetic probing to regression of parallel degradation against the three identified factors—are intended to separate causal influences, benchmarking pseudo-joint and TC effects directly.

Implications for Model Design and Training

On the practical front, the authors outline several actionable regularization and design avenues:

  • Elementary-circulation regularization: Directly penalizes high curl on candidate (parallel) blocks during training, targeting better compatibility of local denoisers for parallel update regimes.
  • Commutator-aware and TC-aware scheduling: In decoding, block selection can be optimized by jointly considering confidence, curl, and within-block conditional dependence, rather than relying solely on token-wise uncertainty or confidence.
  • Parallel trajectory supervision: Augmenting training with multiple equivalent reasoning paths or permuted supervision trajectories reduces order bias and pseudo-joint mismatch for intended parallel operations.
  • Potential-based DLMs: The long-term direction of learning compatible global potentials (joint models from which local denoisers are derived) is discussed, highlighting normalization and scalability challenges in the discrete, high-dimensional setting.

Theoretical and Future Developments

The paper’s findings have significant implications for both the theory and practice of non-autoregressive generative models. The compatibility framework defines what it means, in exact structural terms, for a DLM to support truly arbitrary-order or fully parallel generation—thereby clarifying goals for training and modeling approaches.

From a theoretical standpoint, the formalization specifies necessary and sufficient conditions for order-independent semantics in DLMs and identifies order-induced path dependence as a measurable, nontrivial phenomenon. There are also immediate consequences for evaluation, as validated curl and TC measurements can more faithfully identify genuine parallelization bottlenecks than heuristic proxies such as entropy or mask confidence.

The work suggests several future research directions:

  • Large-scale empirical measurement of pseudo-joint gaps and their predictive power for parallel degradation.
  • Designing more efficient approximations or scalable diagnostics for curl and conditional dependency measures in large DLMs.
  • Development of training objectives and architectures that directly encourage compatibility and manage within-block dependence.

Conclusion

This paper provides a rigorous, mechanism-centric account of order sensitivity and the collapse toward AR-like trajectories in diffusion LLMs. By formalizing arbitrary-order denoising as a pseudo-joint invariance problem, introducing curl as a path-dependence diagnostic, and classifying failure modes, the work establishes a unified perspective on the challenges of parallel and order-free text generation with DLMs. It frames both diagnosis and potential remedies, and raises new questions on the structure, training, and evaluation of non-autoregressive generative LLMs (2605.09303).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.