Denoising Score Entropy (DSE) in Discrete Diffusion

Updated 21 January 2026

Denoising Score Entropy (DSE) is a score-matching loss that bridges discrete CTMC diffusion with exact log-likelihood estimation, eliminating variational looseness.
It employs a time-dependent score network to estimate ratios of CTMC marginals, yielding tight, unbiased predictions with significant variance reduction.
The I-MDSE framework enables principled conditional likelihood estimation and stable training for discrete data in language and sequence modeling.

Denoising Score Entropy (DSE) is a score-matching loss introduced within an information-theoretic framework for discrete-state continuous-time Markov chain (CTMC) diffusion models. DSE establishes a principled connection between diffusion-based generative modeling for discrete data and exact log-likelihood estimation, forming the basis of the Information-Minimum Denoising Score Entropy (I-MDSE) relation. DSE unifies the training of score networks and likelihood evaluation, providing tight, unbiased estimators without variational looseness. This development parallels the role of the I-MMSE identity in Gaussian diffusion for continuous data and extends to a variety of conditional and masked likelihood tasks relevant in modern language and sequence modeling (Jeon et al., 28 Oct 2025).

1. Formal Definition and Loss Structure

Let $x_0\sim p_0$ denote a clean datum, with forward CTMC marginals $\{p_{t|0}(x_t\,|\,x_0)\}_{t\ge0}$ under rate matrix $Q_t$ . A time-dependent score network $s_t:\mathcal X\to\mathbb R^N$ aims to estimate the ratio $s_t(x)_y\approx\frac{p_t(y)}{p_t(x)}$ for $y \neq x$ . The pointwise DSE loss is defined as

$\ell_{\mathrm{DSE}}(x_0, x, t, s_t) = \sum_{y\neq x} Q_t(x, y) \Big[ s_t(x)_y - \frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)} \log s_t(x)_y + K\Big(\frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)}\Big)\Big]$

where $K(a) = a(\log a - 1)$ . The loss is uniquely minimized by the true score $s_t^\star(x)_y = \frac{p_t(y)}{p_t(x)}$ . DSE thus generalizes the continuous-state MSE from Gaussian diffusion to the categorical, logarithmic-loss context, providing a natural, stable loss for discrete score matching.

2. Information-Theoretic Foundations: The I-MDSE Relation

Minimum DSE at time $t$ is defined as

$\mathrm{mdse}(t) = \min_{s_t} \mathbb E_{x_0\sim p_0,\, x_t\sim p_{t|0}(\cdot|x_0)} \big[\ell_{\mathrm{DSE}}(x_0, x_t, t, s_t)\big] = \mathbb E[\ell_{\mathrm{DSE}}(x_0, x_t, t, s_t^\star)]$

with the conditional version

$\mathrm{mdse}(x_0, t) = \mathbb E_{x_t\sim p_{t|0}(\cdot|x_0)}[\ell_{\mathrm{DSE}}(x_0, x_t, t, s_t^\star)]$

The core identity (I-MDSE, Theorem 3.1) is

$\frac{d}{dt} D_{\mathrm{KL}}(p_{t|0}(\cdot|x_0)\,\|\,p_t) = -\mathrm{mdse}(x_0, t)$

Averaging over $x_0\sim p_0$ yields

$\frac{d}{dt} I(x_0 ; x_t) = -\mathrm{mdse}(t)$

This identity, proven via the path-space KL chain rule and Dynkin’s formula, asserts that the instantaneous rate of mutual information decay between data and diffused states is precisely the minimum DSE.

3. Log-Likelihood Decomposition and Exactness

Integrating the I-MDSE relation over time leads to the following time-integral decomposition (Theorem 3.2):

$-\log p_0(x_0) = \int_0^T \mathrm{mdse}(x_0, t)\,dt + D_{\mathrm{KL}}(p_{T|0}(\cdot|x_0)\,\|\,p_T)$

For $T\to\infty$ and $p_T\to\pi$ ,

$-\log p_0(x_0) = \int_0^\infty \mathrm{mdse}(x_0, t)\,dt$

This is an equality, not a variational bound. The area under the minimum DSE curve exactly recovers the negative log-likelihood of the initial data. Earlier score-matching approaches supplied only upper bounds; DSE provides tight, unbiased, and exact log-likelihood estimation in the discrete regime.

4. Properties as a Score and Likelihood Estimator

DSE is characterized by the following properties:

Tightness: The integral of minimum DSE precisely characterizes $-\log p_0$ , with $\mathrm{mdse}(t)\ge0$ and no bias beyond model error if $s_t^\theta \approx s_t^\star$ .
Variance: The decomposition removes the “variational gap” found in prior methods, yielding lower Monte Carlo variance, especially in its time-free formulation. Empirically, time-free estimators exhibit up to an order-of-magnitude lower variance compared to time-integral approaches.
Comparative Stability: For discrete CTMC, DSE functions as the canonical, logarithmic-loss analogue to the MSE in Gaussian diffusion, yielding stable training and principled likelihood estimation, in contrast to unstable objectives such as plain $\ell^2$ on score ratios.

5. Practical Extensions and Applications

The I-MDSE framework enables several algorithmic constructions:

Time-Free Likelihood Estimation: For sequences of length $L$ , the estimator

$-\log p_0(x_0) = H_L\,\mathbb E_I\Big[\sum_{i\notin I} \log \frac{1}{p_0(x_0^i\,|\,x_0^I)}\Big]$

with $H_L$ the $L$ th harmonic number, and $I$ sampled via a Beta-weighted scheme, provides a single-shot Monte Carlo estimator of log-likelihood with lower sample variance than direct time integration.

Conditional and Prompt-Response Likelihoods: For disjoint index sets $I_1$ (target) and $I_2$ (context):

$-\log p_0(x_0^{I_1}|x_0^{I_2}) = H_{|I_1|}\,\mathbb E_J\Big[\sum_{i\in I_1\setminus J} \log \frac{1}{p_0(x_0^i| x_0^{J\cup I_2})}\Big]$

This identity directly enables prompt-to-response or conditioned sequence likelihoods, as in language modeling.

Coupled Ratio Estimation: For two sequences $x, y \in \mathcal X^L$ ,

$\log \frac{p_0(y)}{p_0(x)} = H_L\,\mathbb E_I\Big[\sum_{i\notin I} (\log p_0(y^i|y^I) - \log p_0(x^i|x^I))\Big]$

Shared mask coupling between $x$ and $y$ significantly reduces the variance of log-likelihood ratio estimates compared to independent estimators.

6. Empirical Performance and Experimental Results

Table: Empirical Findings for DSE-Based Estimators

Task/Domain	Estimation Benchmark	Variance Reduction
DNA sequence (length 8)	Ground-truth log-probabilities	Time-free estimator matches ground truth
Markov chain subsequence (length 32)	Conditional log-likelihoods	Exact match to Markov chain likelihood
HellaSwag, ARC-hard, PIQA	Large-scale NLL estimation	Variance falls $4$– $6\times$ (Table 1)
BeaverTails (ratio estimator)	Likelihood ratio estimation	$7\times$ lower variance (Fig. 2(b))

In all large-scale benchmarks, approximately $100$ Monte Carlo samples per sequence suffice for high-precision, low-variance estimation, owing to the efficiency of the time-free and coupled approaches. In out-of-distribution detection (text8 vs. GPT-4 continuations), DSE-based negative log-likelihood reliably separates OOD and true samples. For model auditing (e.g., LLaDA 8B on WikiText vs. LLaMA-3.1), DSE reveals statistically significant NLL differences, suggesting model influence and aiding attribution studies (Jeon et al., 28 Oct 2025).

7. Significance and Theoretical Implications

Denoising Score Entropy provides the first exact, information-theoretic bridge between discrete diffusion score learning and log-likelihood evaluation. It eliminates the variational looseness of prior score-based objectives, giving rise to estimators that are not only unbiased but practical in terms of sample efficiency and variance reduction. This directly enables principled evaluation of large generative LLMs, conditional inference in prompt-response tasks, and robust uncertainty estimates in discrete generative modeling. A plausible implication is that DSE-based approaches will influence future work on likelihood-based evaluation and calibration of discrete generative architectures.

Markdown Report Issue Upgrade to Chat

References (1)

Information-Theoretic Discrete Diffusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Denoising Score Entropy (DSE).