Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

Published 18 May 2026 in stat.ML and cs.LG | (2605.18204v1)

Abstract: Discrete diffusion models are a powerful class of generative models with strong performance across many domains. For efficiency, however, discrete diffusion typically parameterizes the generative (reverse) process with factorized distributions, which makes it difficult for the model to learn the target process in a small number of steps and necessitates a long, computationally expensive sampling procedure. To reduce the gap between the target and model distributions and enable few-step generation, we propose Forward-Learned Discrete Diffusion (FLDD), which introduces discrete diffusion with a learnable forward (noising) process. Rather than fixing a Markovian forward chain, we adopt a non-Markovian formulation with learnable marginal and posterior distributions. This allows the generative process to remain factorized while matching the target defined by the noising process. We train all parameters end-to-end under the standard variational objective. Experiments on various benchmarks show that, for a given number of sampling steps, our approach produces a higher quality samples than conventional discrete diffusion models using the same reverse parameterization.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces FLDD, a novel framework that learns an adaptive forward process to align with a fixed reverse denoising model for efficient few-step generative modeling.
It employs a learnable, non-Markovian forward process with coordinate-wise categorical distributions and a Maximum Coupling strategy to optimize variational bounds.
Experiments on text, molecular, and image data demonstrate that FLDD maintains high sample quality and speed even when using significantly reduced denoising steps.

Forward-Learned Discrete Diffusion: Summary and Analysis

Introduction and Motivation

Discrete diffusion models provide a compelling alternative to autoregressive generative approaches by supporting parallel sampling over the dimensionality of discrete data. However, a core limitation lies in the prohibitive number of denoising steps required to achieve high-fidelity samples, predominantly due to the reverse process being parameterized with factorized distributions. This mismatch between the target and model distributions, particularly when the number of reverse (sampling) steps is reduced, leads to degraded performance and thus restricts the benefits of parallelism.

The "Forward-Learned Discrete Diffusion" (FLDD) framework directly addresses this by introducing a data-adaptive, learnable forward (noising) process. Rather than adhering to the conventional Markovian, fixed-form forward transition (such as uniform masking or absorption), FLDD parameterizes both forward marginals and posteriors, allowing a non-Markovian and data-dependent information destruction schedule. The generative process remains unchanged (i.e., the standard factorized reverse model), but the target distribution for the reverse process is adaptively induced by the learned forward path, compensating for its limited expressivity and enabling efficient, few-step generative modeling.

Methodology

Learnable Forward Process

FLDD generalizes the forward process by directly modeling $q_\psi(z_t \mid x)$ and $q_\psi(z_s \mid z_t, x)$ as learnable distributions. Crucially, this allows the marginalized target $q_\psi(z_s \mid z_t)$ (defining the optimal reverse denoising target) to be explicitly shaped to match the factorized family the generator can represent. The forward marginals are implemented as coordinate-wise categorical distributions, each conditioned on the full input $x$ . For posterior construction (to enable the KL computations in the variational lower bound), a Maximum Coupling strategy is adopted—efficiently redistributing categorical mass between steps.

The forward process is optimized end-to-end alongside the reverse model under the standard evidence lower bound (ELBO). Optimization employs a two-phase strategy: initial warm-up with a continuous relaxation (Concrete distribution with annealed temperature) followed by unbiased gradient estimation using REINFORCE, accommodating the non-differentiable sampling of discrete variables.

Reverse (Generative) Process

FLDD retains the standard, factorized categorical parameterization for $p_\theta(z_s \mid z_t)$ and uses parallel sampling—a key to efficient generation. The innovation is that the forward process, trained to align with the fixed reverse parameterization, delivers a tractable target distribution even for small $T$ . No changes are made to the inference pipeline, keeping computational costs on par with standard discrete diffusion.

Experimental Results

Toy Examples

FLDD reproduces non-trivial correlated discrete distributions with a minimal number of generation steps. For a mixture of Gaussians or discrete random walk (as constructed in motivating examples), the learned forward process crafts intermediate distributions such that the reverse factorized generator can match the data with as little as two steps—demonstrating that appropriate noising schedules can enable tractable, expressive generation.

Real-World Data

Text Generation (ROCStories)

FLDD is evaluated on the ROCStories text dataset, demonstrating:

At $T = 100$ steps: Comparable MAUVE, PPL, and diversity to strong discrete diffusion baselines.
At $T = 10$ steps: Only a marginal quality drop, whereas conventional discrete diffusion collapses in sample realism at this step count.

Molecular Generation (QM9, ZINC250k)

FLDD achieves validity, uniqueness, and Fréchet ChemNet Distance (FCD) on par with or better than state-of-the-art discrete diffusion and flow-based generative models when $T = 100$ .
Only minor degradation is observed when reducing to $T = 10$ , showing robustness of the learned forward process in enabling high-speed, parallel generative modeling.

Masked Diffusion (Images, MNIST)

When applied in a masking setting (where the forward process learns to mask tokens/pixels based on their conditional independence), FLDD yields more plausible image samples than baseline Masked Diffusion Models, particularly under limited generative steps. The learned masking schedule prioritizes less-correlated (i.e., easier to denoise) pixels, improving denoising fidelity compared to naive uniform masking.

Theoretical and Practical Implications

FLDD's most notable claim is the alignment of the noising forward process to the inductive biases of the factorized reverse generative family, thereby bridging the expressivity gap that hampers few-step discrete diffusion. Strong results at low step counts (without modifications to the reverse process) provide an avenue towards high-throughput, scalable discrete generation in domains where AR or fixed-schedule diffusion is impracticable.

Theoretically, this framework decouples the historical design requirement for the forward process to be simple (e.g., fixed masking), showing end-to-end learning and non-Markovianity can yield better variational bounds and sample quality. Importantly, the latent structure of the forward process plays an underappreciated role in enabling or disabling efficient denoising for discrete data.

Practically, FLDD introduces only per-training cost overhead (due to the forward network parameterization and REINFORCE variance), but generation is unaffected, preserving latency and hardware utilization. This holds immediate promise for text, graph, and molecule generation pipelines currently bottlenecked by long sampling chains.

Limitations and Future Directions

The introduction of a learnable forward network doubles parameter count (for training), though generative complexity remains constant.
Reliance on REINFORCE produces high gradient variance; future work should investigate lower-variance discrete gradient estimators, or hybrid continuous-discrete relaxations.
The choice of forward/posterior parameterization is not unique—richer structures beyond coordinatewise Maximum Coupling (e.g., partially AR/noisy OT) may further enhance expressivity and sample efficiency.

Open directions include combining FLDD with distillation, hybrid AR-diffusion samplers, or application to longer sequences and richer modalities (e.g., high-resolution images, protein design).

Conclusion

FLDD provides a principled framework for learning adaptive noising schedules tailored to the fixed reverse (denoising) parameterizations common in discrete diffusion models. By aligning forward and reverse processes in distributional space, high-fidelity and efficient discrete generative modeling is achievable in significantly fewer steps—bridging a gap previously thought intrinsic to discrete domains. This work suggests the forward process is a potent, under-explored lever for further generative modeling advances. Future research on forward process parameterization and optimization strategies will be central to advancing diffusion models in discrete structured domains.

Markdown Report Issue