Denoising Score Entropy (DSE) in Discrete Diffusion
- Denoising Score Entropy (DSE) is a score-matching loss that bridges discrete CTMC diffusion with exact log-likelihood estimation, eliminating variational looseness.
- It employs a time-dependent score network to estimate ratios of CTMC marginals, yielding tight, unbiased predictions with significant variance reduction.
- The I-MDSE framework enables principled conditional likelihood estimation and stable training for discrete data in language and sequence modeling.
Denoising Score Entropy (DSE) is a score-matching loss introduced within an information-theoretic framework for discrete-state continuous-time Markov chain (CTMC) diffusion models. DSE establishes a principled connection between diffusion-based generative modeling for discrete data and exact log-likelihood estimation, forming the basis of the Information-Minimum Denoising Score Entropy (I-MDSE) relation. DSE unifies the training of score networks and likelihood evaluation, providing tight, unbiased estimators without variational looseness. This development parallels the role of the I-MMSE identity in Gaussian diffusion for continuous data and extends to a variety of conditional and masked likelihood tasks relevant in modern language and sequence modeling (Jeon et al., 28 Oct 2025).
1. Formal Definition and Loss Structure
Let denote a clean datum, with forward CTMC marginals under rate matrix . A time-dependent score network aims to estimate the ratio for . The pointwise DSE loss is defined as
where . The loss is uniquely minimized by the true score . DSE thus generalizes the continuous-state MSE from Gaussian diffusion to the categorical, logarithmic-loss context, providing a natural, stable loss for discrete score matching.
2. Information-Theoretic Foundations: The I-MDSE Relation
Minimum DSE at time is defined as
with the conditional version
The core identity (I-MDSE, Theorem 3.1) is
Averaging over yields
This identity, proven via the path-space KL chain rule and Dynkin’s formula, asserts that the instantaneous rate of mutual information decay between data and diffused states is precisely the minimum DSE.
3. Log-Likelihood Decomposition and Exactness
Integrating the I-MDSE relation over time leads to the following time-integral decomposition (Theorem 3.2):
For and ,
This is an equality, not a variational bound. The area under the minimum DSE curve exactly recovers the negative log-likelihood of the initial data. Earlier score-matching approaches supplied only upper bounds; DSE provides tight, unbiased, and exact log-likelihood estimation in the discrete regime.
4. Properties as a Score and Likelihood Estimator
DSE is characterized by the following properties:
- Tightness: The integral of minimum DSE precisely characterizes , with and no bias beyond model error if .
- Variance: The decomposition removes the “variational gap” found in prior methods, yielding lower Monte Carlo variance, especially in its time-free formulation. Empirically, time-free estimators exhibit up to an order-of-magnitude lower variance compared to time-integral approaches.
- Comparative Stability: For discrete CTMC, DSE functions as the canonical, logarithmic-loss analogue to the MSE in Gaussian diffusion, yielding stable training and principled likelihood estimation, in contrast to unstable objectives such as plain on score ratios.
5. Practical Extensions and Applications
The I-MDSE framework enables several algorithmic constructions:
- Time-Free Likelihood Estimation: For sequences of length , the estimator
with the th harmonic number, and sampled via a Beta-weighted scheme, provides a single-shot Monte Carlo estimator of log-likelihood with lower sample variance than direct time integration.
- Conditional and Prompt-Response Likelihoods: For disjoint index sets (target) and (context):
This identity directly enables prompt-to-response or conditioned sequence likelihoods, as in language modeling.
- Coupled Ratio Estimation: For two sequences ,
Shared mask coupling between and significantly reduces the variance of log-likelihood ratio estimates compared to independent estimators.
6. Empirical Performance and Experimental Results
Table: Empirical Findings for DSE-Based Estimators
| Task/Domain | Estimation Benchmark | Variance Reduction |
|---|---|---|
| DNA sequence (length 8) | Ground-truth log-probabilities | Time-free estimator matches ground truth |
| Markov chain subsequence (length 32) | Conditional log-likelihoods | Exact match to Markov chain likelihood |
| HellaSwag, ARC-hard, PIQA | Large-scale NLL estimation | Variance falls $4$– (Table 1) |
| BeaverTails (ratio estimator) | Likelihood ratio estimation | lower variance (Fig. 2(b)) |
In all large-scale benchmarks, approximately $100$ Monte Carlo samples per sequence suffice for high-precision, low-variance estimation, owing to the efficiency of the time-free and coupled approaches. In out-of-distribution detection (text8 vs. GPT-4 continuations), DSE-based negative log-likelihood reliably separates OOD and true samples. For model auditing (e.g., LLaDA 8B on WikiText vs. LLaMA-3.1), DSE reveals statistically significant NLL differences, suggesting model influence and aiding attribution studies (Jeon et al., 28 Oct 2025).
7. Significance and Theoretical Implications
Denoising Score Entropy provides the first exact, information-theoretic bridge between discrete diffusion score learning and log-likelihood evaluation. It eliminates the variational looseness of prior score-based objectives, giving rise to estimators that are not only unbiased but practical in terms of sample efficiency and variance reduction. This directly enables principled evaluation of large generative LLMs, conditional inference in prompt-response tasks, and robust uncertainty estimates in discrete generative modeling. A plausible implication is that DSE-based approaches will influence future work on likelihood-based evaluation and calibration of discrete generative architectures.