The Diffusion Duality (2506.10892v1)
Abstract: Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion LLMs by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo
Summary
- The paper introduces Duo, a discrete diffusion framework that transfers Gaussian diffusion techniques to improve training and sampling for Uniform-state models.
- It employs a curriculum learning strategy with a tempered softmax and controlled time sampling to reduce variance and achieve a 2x training speedup.
- The proposed Discrete Consistency Distillation allows few-step generation, significantly accelerating sampling while preserving competitive sample quality.
This paper, "The Diffusion Duality" (2506.10892), introduces Duo, a framework for discrete diffusion models that leverages a newly discovered theoretical connection to continuous Gaussian diffusion. The core insight is that Uniform-state discrete diffusion processes can be seen as emerging from an underlying Gaussian diffusion process via the argmax operator. This "Diffusion Duality" allows the transfer of powerful techniques from the well-studied field of Gaussian diffusion to improve both the training and sampling efficiency of discrete diffusion models, specifically Uniform-state Diffusion Models (USDMs).
The paper addresses the limitations of existing discrete diffusion models, which have historically underperformed autoregressive models and Masked Diffusion Models (MDMs) in text generation, particularly in few-step generation regimes. While USDMs possess a desirable self-correcting property during sampling, their training and inference techniques have been relatively primitive compared to Gaussian diffusion.
The Diffusion Duality Explained:
The paper establishes a formal connection between a Gaussian diffusion process on real-valued vectors y~t∈RK and a Uniform-state discrete diffusion process on one-hot vectors yt∈V={v∈{0,1}K:i∑vi=1}. This connection is mediated by the argmax operator, denoted A(z)=eargmaxizi, which maps a continuous vector to a one-hot vector corresponding to its largest entry.
The paper proves that if y~t follows a Gaussian diffusion process with marginals q~t(y~t∣x)=N(α~tx,(1−α~t2)IK), then applying the argmax operator, yt=A(y~t), results in yt following a Uniform-state discrete diffusion process. The marginal distribution of the discrete latent yt is a Categorical distribution qt(yt∣x)=Cat(yt;αtx+(1−αt)1/K), where the discrete diffusion parameter αt is a function of the Gaussian parameter α~t (Equation 5). This function, T:[0,1]→[0,1], is called the Diffusion Transformation operator. The time evolution of the discrete marginal probability density also follows the characteristic linear ODE of a Uniform-state discrete diffusion process (Equation 7).
A key theoretical finding is that the Evidence Lower Bound (ELBO) for the Uniform-state discrete diffusion process is a tighter lower bound on the data log-likelihood than the ELBO for the underlying Gaussian diffusion process (Theorem 3.1). This suggests that training models to denoise in the discrete space is theoretically preferable for likelihood maximization.
For sampling, Duo uses ancestral sampling from the learned approximate reverse process, similar to other USDMs. The paper also proposes a Greedy-Tail sampler, which performs greedy decoding during the final denoising step to potentially improve sample quality and reduce entropy, analogous to nucleus sampling in AR models.
Practical Applications:
The Diffusion Duality enables two main practical applications:
- Faster Training using Curriculum Learning:
- Problem: Training discrete diffusion models, especially USDMs, can be slow and suffer from high variance, particularly when recovering the clean signal from highly noisy latents (small αt).
- Solution: The paper proposes a curriculum learning strategy by leveraging the connection to Gaussian latents. The discrete NELBO objective (Equation 8) is re-expressed in terms of expectations over the continuous Gaussian latents y~t (Equation 9).
- Implementation: Instead of feeding the discrete argmax output A(y~t) directly into the denoising model, the paper uses a tempered softmax approximation softmax(y~t/τ), where τ>0 is the temperature (Equation 10). The denoising model is designed to accept both continuous (tempered softmax) and discrete (one-hot) inputs. During training, a curriculum is applied by annealing the temperature τ over time (starting with τ=0.001 and reducing to τ=0) and restricting the sampling of diffusion times t to a sub-interval (e.g., α~t∈[0.85,1] for t∈[0.03,0.15]), where the Gaussian latents retain more signal relative to the discrete latents.
- Benefits: This approach introduces a controlled bias but significantly reduces training variance (Figure 1, Table 4), leading to faster convergence (empirically shown as 2x speedup) and improved likelihood compared to prior USDMs (Table 1). The improved Rao-Blackwellized NELBO formulation (Equation 15) also contributes to variance reduction and memory efficiency.
- Discrete Consistency Distillation (DCD):
- Problem: Achieving fast, few-step generation (e.g., ≤10 steps) is challenging for discrete diffusion models because they lack the deterministic Probability Flow ODEs (PF-ODEs) used for fast sampling and distillation in Gaussian diffusion models.
- Solution: DCD leverages the PF-ODE of the underlying Gaussian diffusion process. Although the discrete denoiser cannot parameterize a Gaussian PF-ODE directly, the paper constructs "Deterministic Discrete Trajectories (DDT)" by taking the deterministic trajectory of the optimal Gaussian PF-ODE (which maps Gaussian noise to the clean data) and projecting it to the discrete space using the argmax operator (Equation 13).
- Implementation: A student denoising model is trained to match the output of a teacher model (initially the pre-trained base model) along these DDT trajectories (Algorithm 1). The loss function minimizes the KL divergence between the discrete probability distributions predicted by the teacher and student models for the clean data (Equation 14). Distillation proceeds in rounds, increasing the step size δ in each round.
- Benefits: DCD enables few-step generation, accelerating sampling by two orders of magnitude (from 1024 steps down to 8 steps) with minimal degradation in sample quality (Figure 3, Figure 4). In the low-NFE regime (T≤32), the distilled Duo model significantly outperforms distilled MDMs, likely due to the inherent self-correcting property of USDMs. Using the previous round's denoising model weights directly as the teacher, rather than EMA weights, is found to be more effective (Figure 5, Table 5).
Experimental Results:
Experiments conducted on LM1B and OpenWebText LLMing benchmarks using a 170M parameter Diffusion Transformer demonstrate Duo's effectiveness.
- Likelihood: Duo achieves state-of-the-art perplexity among Uniform-state and Gaussian diffusion models on LM1B and OWT (Table 1). It significantly narrows the performance gap with leading Absorbing State MDMs. On zero-shot evaluation across 7 datasets, Duo outperforms previous USDMs and Gaussian models, and notably surpasses an autoregressive transformer on 3 benchmarks (Table 2).
- Training Speed: The curriculum learning strategy provides a measured 2x speedup in convergence compared to training USDMs without the curriculum.
- Sampling Speed & Quality: The base Duo model provides better generative sample quality (lower Gen PPL) than all previous diffusion models across various sampling steps (Figure 3, Table 6). DCD successfully distills Duo, allowing generation in as few as 8 steps while maintaining competitive sample quality, outperforming distilled MDLM at low NFEs (Figure 4, Table 7). The Greedy-Tail sampler further improves Gen PPL for distilled models.
Implementation Considerations:
Implementing Duo involves:
- Designing a denoising model architecture capable of processing both continuous (soft) and discrete (hard) token representations as input.
- Pre-computing or efficiently computing the Diffusion Transformation operator T between α~t and αt.
- Implementing the modified NELBO objective (Equation 15).
- Implementing the curriculum learning strategy by controlling the temperature τ and the time sampling interval [β,γ].
- Implementing the DCD algorithm, including the generation of DDT trajectories and the KL divergence loss for distillation.
- Using bfloat16 precision for the main model computation, potentially with float64 for sensitive parts like loss computation in Gaussian diffusion methods, as noted for baseline comparisons.
The paper provides code and model checkpoints on a project page, facilitating practical application and further research. The findings suggest that the Diffusion Duality offers a promising path to improve discrete generative models by borrowing techniques from their continuous counterparts.