Papers
Topics
Authors
Recent
2000 character limit reached

Authentic Discrete Diffusion Model (ADD)

Updated 21 December 2025
  • The model introduces direct diffusion in one-hot discrete space, unifying discriminative and generative tasks through a timestep-conditioned cross-entropy loss.
  • It leverages Markovian forward processes and CTMC jump processes to ensure exact time reversal and exponential convergence in sampling efficiency.
  • Empirical results demonstrate that ADD achieves high performance in image classification and text generation, offering superior parameter and sample efficiency over pseudo-discrete models.

The Authentic Discrete Diffusion Model (ADD) is a class of generative stochastic models that implements diffusion directly in the discrete (typically one-hot or categorical) space, fundamentally departing from earlier masked modeling and embedding-based "pseudo-discrete diffusion" approaches. ADD provides a principled probabilistic framework for both discriminative (classification) and generative (e.g., text, symbolic, or code generation) tasks by operating on the sparse simplex geometry intrinsic to categorical data. Its training and inference procedures, loss function, and theoretical guarantees are defined in direct analogy to continuous-state diffusion but are tailored for discrete mathematics and Markovian noise processes (Li et al., 1 Oct 2025, Sun et al., 2022, Chen et al., 12 Feb 2024).

1. Motivation and Conceptual Foundation

ADD arises in response to limitations of conventional "pseudo" discrete diffusion models (PDD), where the discrete nature of data is often either ignored (by Gaussian noising in a continuous latent space) or addressed only superficially (via token masking and imputation). PDD approaches—such as repeated masked token recovery (Gemini Diffusion, Dream 7B, DINOISER, Fast-dLLM) or Gaussian diffusion on embeddings (DiffuSeq)—lack the following properties:

  • A true, stochastic forward noising process that respects the discrete geometry
  • Markovian transitions as in classical diffusion
  • A loss function properly reflecting the exclusivity of discrete categories.

ADD instead:

  • Directly diffuses in the one-hot or "λ-hot" representation on the corners of the simplex, preserving mutual exclusivity
  • Unifies discriminative and generative learning by allowing the same model backbone to address classification and sequence generation without bridging through a continuous intermediate representation
  • Employs a loss function (timestep-conditioned cross-entropy) that naturally supervises the reconstruction of discrete states at all levels of corruption (Li et al., 1 Oct 2025).

2. Mathematical Framework

2.1 State Space and Input Representation

  • For discrete classification or generation tasks with KK categories, ground-truth data is encoded as a one-hot vector y0{0,1}K\mathbf{y}_0 \in \{0,1\}^K with ky0(k)=1\sum_k y_0^{(k)} = 1.
  • Optionally, "λ-hot" relaxations y0[0,1]K\mathbf{y}_0 \in [0,1]^K with ky0(k)=1\sum_k y_0^{(k)} = 1 are used for smoother noise injection.

2.2 Forward Diffusion Processes

Two principal formulations exist within the ADD paradigm:

(A) Gaussian One-hot Space Diffusion (Li et al., 1 Oct 2025):

  • At each step tt, the process is:

q(yty0)=N(yt;αˉty0,(1αˉt)IK)q(\mathbf{y}_t\mid\mathbf{y}_0) = \mathcal{N}(\mathbf{y}_t; \sqrt{\bar\alpha_t} \mathbf{y}_0, (1-\bar\alpha_t) \mathbf{I}_K)

Progressively, noise is injected while keeping trajectories near the simplex corners.

(B) Continuous-Time Markov Chain (CTMC) Jump Process (Sun et al., 2022, Chen et al., 12 Feb 2024):

  • The forward noising is a CTMC on categorical space X\mathcal{X}, with forward generator Q(t)Q(t):

ddtpt(x)=yxpt(y)Qy,x(t)pt(x)yxQx,y(t)\frac{d}{dt} p_t(x) = \sum_{y \neq x} p_t(y) Q_{y,x}(t) - p_t(x) \sum_{y \neq x} Q_{x,y}(t)

  • For the Boolean hypercube {0,1}d\{0,1\}^d, the canonical "independent-flip" generator sets Qx,x+ei=1Q_{x,x+e_i} = 1, where eie_i is the unit vector flipping bit ii.

2.3 Reverse (Denoising) Processes

  • The reverse-time process is again a Markov process, with generator:

Qx,y(t)=Qy,x(t)py(t)px(t)Q^{\leftarrow}_{x,y}(t) = Q_{y,x}(t) \frac{p_y(t)}{p_x(t)}

where px(t)p_x(t) is the marginal of state xx at time tt (Chen et al., 12 Feb 2024).

  • In parameterized models, the density ratios (discrete "scores") cx,y(t)=py(t)/px(t)c_{x,y}(t) = p_y(t)/p_x(t) are estimated by a neural network sx,y(t)s_{x,y}(t).

2.4 Training Objectives

(A) Timestep-Conditioned Cross-Entropy Loss (Li et al., 1 Oct 2025):

  • The loss supervises prediction of the original one-hot target at every noise level:

L(θ)=Ey0,t[αˉtk=1Ky0(k)logpθ(y0(k)yt,c)]\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{y}_0, t} \left[ -\bar\alpha_t \sum_{k=1}^K y_0^{(k)} \log p_\theta(y_0^{(k)} \mid \mathbf{y}_t, c) \right]

  • Decay weighting with αˉt\bar\alpha_t prevents loss of conditioning signal under heavy noise.

(B) Denoising Score Entropy / Ratio-Matching (Sun et al., 2022, Chen et al., 12 Feb 2024):

  • For each coordinate:

L(θ)=0TExtqt[d=1Dlogpt(Xd=xtdxt¬d;θ)]dtL(\theta) = \int_0^T \mathbb{E}_{x_t \sim q_t} \left[ -\sum_{d=1}^D \log p_t(X^d = x_t^d \mid x_t^{\neg d}; \theta) \right] dt

  • Proposition 3.4 in (Sun et al., 2022) shows this is an unbiased estimator for the score-matching objective.

3. Algorithmic and Architectural Details

3.1 Model Components

  • Encoder: Typically a Vision Transformer (ViT) encoder for image or multimodal data (Li et al., 1 Oct 2025).
  • Discrete Denoising Network: Transformer or masked modeling neural architecture receives noisy one-hot vectors, timestep, and conditioning.
  • Time Embedding: Sinusoidal or learned embeddings encode the diffusion timestep.
  • Conditioning Pathways: Cross-attention or modulation injects auxiliary information (e.g., class token, image patch mean).

3.2 Sampling and Implementation

  • Forward Sampling (Gaussian variant):
    • Gaussian noising of one-hot vectors using recursively computed αt\sqrt{\alpha_t} and 1αt\sqrt{1-\alpha_t} terms.
  • CTMC Uniformization Sampler (Chen et al., 12 Feb 2024):
    • Poisson events on the time interval, each triggering a potential state transition using the embedded Markov chain kernel.
    • No discretization bias; exact simulation of the continuous-time chain.
  • Reverse Sampling:
    • At each reverse step, logits are computed, discretized (typically via argmax and one-hot), and noise is re-injected for the next step.
    • In CTMC-based models, analytic samplers reparameterize the conditionals as mixtures over initial states and transition kernels for zero discretization error.

3.3 Optimization Protocols

  • AdamW optimizer with large batch sizes and weight decay (Li et al., 1 Oct 2025).
  • Diffusion typically uses T=1000T=1000 steps, though convergence is rapid (\sim10–20 steps) in practice.
  • Auxiliary techniques include classifier-free guidance (dropping the conditioning part stochastically to allow guided sampling at inference), layer norm, and gradient clipping for stability.

4. Theoretical Properties and Convergence

ADD's definition as a discrete diffusion guarantees several important theoretical properties:

  • Exact Time Reversal: CTMC-based forward processes admit an exact reverse-time generator, ensuring the model can theoretically recover the initial data distribution if the ratio estimators are perfect (Sun et al., 2022, Chen et al., 12 Feb 2024).
  • Convergence Rates: For hypercube forward processes, exponential convergence to the uniform measure is achieved, with KL and TV error guarantees:
    • KL[p(T)γ]eTKL[p(0)γ]=O(deT)KL[p(T) \| \gamma] \leq e^{-T} \cdot KL[p(0) \| \gamma] = O(d e^{-T}) for dd-dimensional data.
  • Complexity: Uniformization-based samplers yield O(dlog(d/ε))O(d \log(d/\varepsilon)) computational cost to reach O(ε)O(\varepsilon) error in KL, substantially improving over the O(ε2)O(\varepsilon^{-2}) cost in continuous SDE samplers (Chen et al., 12 Feb 2024).
  • Score-Matching Consistency: Matching all coordinate-wise conditional marginals ensures full consistency with the true score and guarantees model convergence in the infinite capacity limit.
  • Unbiased Estimation: ADD’s ratio-matching objectives provide unbiased estimates of relevant functionals, and analytic samplers eliminate discretization bias.

5. Empirical Performance and Ablations

5.1 Image Classification (ImageNet-1K)

  • ADD applied to ViT-Base (111M parameters) achieves up to 83.0% Top-1 accuracy, surpassing standard ViT-Base (87M, 82.3%) and competitive with ViT-Large (305M, 82.6%) (Li et al., 1 Oct 2025).
  • High sample efficiency: only \sim20 denoising steps are needed for convergence.

5.2 Text Generation (COCO Captions)

  • On COCO, ADD-generated captions reach a CLIP score of 0.25, dramatically exceeding masked-token PDD baselines (0.18) and approaching ground-truth (0.30).
  • ADD captions are grammatical and semantically faithful; PDD outputs often show poor syntax and low relevance.

5.3 Ablation Studies

Variant Top-1 (%)
ADD w/ regressive MSE loss 0.13
ADD w/ timestep-conditioned CE (full model) 82.7
– no classifier-free guidance 82.36
– no timestep weighting (αˉt\bar\alpha_t) 82.96
– softmax sampling instead of argmax\to1-hot 82.35

Interpretation: Cross-entropy loss is essential for performance; classifier-free guidance, timestep weighting, and argmax-on-hot discretization produce incremental but tangible gains (Li et al., 1 Oct 2025).

6. Extensions, Strengths, Limitations, and Future Directions

6.1 Strengths

  • Geometry preservation: Maintains simplex structure essential for categorical data.
  • Unified discriminative–generative backbone: Accommodates both classification and generation.
  • Sampling efficiency: Requires few reverse steps for high-fidelity outputs.
  • Parameter efficiency: Outperforms larger models with fewer parameters.

6.2 Limitations

  • Non-differentiable argmax\toone-hot: Current approaches avoid backpropagation through discretization, restricting direct training of the full reverse process.
  • Scalability: Very large vocabularies and long sequences pose computational challenges; further innovations are necessary for LLM-scale applications.

6.3 Future Directions

  • Rigorous analysis of discrete diffusion convergence and entropy dynamics
  • Scaling ADD to LLMs as an alternative to autoregressive methods
  • Extending ADD-type CTMCs to multimodal or structured symbolic domains (e.g., program synthesis, scene graphs)

A plausible implication is that ongoing theoretical advances in score estimation and reverse-process learning will address remaining challenges in differentiability and scalability. Uniformization and efficient sparsification present promising avenues for pushing the limits of categorical generative modeling in high dimensions (Chen et al., 12 Feb 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Authentic Discrete Diffusion Model (ADD).