Papers
Topics
Authors
Recent
Search
2000 character limit reached

Denoising Score Entropy (DSE) in Discrete Diffusion

Updated 21 January 2026
  • Denoising Score Entropy (DSE) is a score-matching loss that bridges discrete CTMC diffusion with exact log-likelihood estimation, eliminating variational looseness.
  • It employs a time-dependent score network to estimate ratios of CTMC marginals, yielding tight, unbiased predictions with significant variance reduction.
  • The I-MDSE framework enables principled conditional likelihood estimation and stable training for discrete data in language and sequence modeling.

Denoising Score Entropy (DSE) is a score-matching loss introduced within an information-theoretic framework for discrete-state continuous-time Markov chain (CTMC) diffusion models. DSE establishes a principled connection between diffusion-based generative modeling for discrete data and exact log-likelihood estimation, forming the basis of the Information-Minimum Denoising Score Entropy (I-MDSE) relation. DSE unifies the training of score networks and likelihood evaluation, providing tight, unbiased estimators without variational looseness. This development parallels the role of the I-MMSE identity in Gaussian diffusion for continuous data and extends to a variety of conditional and masked likelihood tasks relevant in modern language and sequence modeling (Jeon et al., 28 Oct 2025).

1. Formal Definition and Loss Structure

Let x0p0x_0\sim p_0 denote a clean datum, with forward CTMC marginals {pt0(xtx0)}t0\{p_{t|0}(x_t\,|\,x_0)\}_{t\ge0} under rate matrix QtQ_t. A time-dependent score network st:XRNs_t:\mathcal X\to\mathbb R^N aims to estimate the ratio st(x)ypt(y)pt(x)s_t(x)_y\approx\frac{p_t(y)}{p_t(x)} for yxy \neq x. The pointwise DSE loss is defined as

DSE(x0,x,t,st)=yxQt(x,y)[st(x)ypt0(yx0)pt0(xx0)logst(x)y+K(pt0(yx0)pt0(xx0))]\ell_{\mathrm{DSE}}(x_0, x, t, s_t) = \sum_{y\neq x} Q_t(x, y) \Big[ s_t(x)_y - \frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)} \log s_t(x)_y + K\Big(\frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)}\Big)\Big]

where K(a)=a(loga1)K(a) = a(\log a - 1). The loss is uniquely minimized by the true score st(x)y=pt(y)pt(x)s_t^\star(x)_y = \frac{p_t(y)}{p_t(x)}. DSE thus generalizes the continuous-state MSE from Gaussian diffusion to the categorical, logarithmic-loss context, providing a natural, stable loss for discrete score matching.

2. Information-Theoretic Foundations: The I-MDSE Relation

Minimum DSE at time tt is defined as

mdse(t)=minstEx0p0,xtpt0(x0)[DSE(x0,xt,t,st)]=E[DSE(x0,xt,t,st)]\mathrm{mdse}(t) = \min_{s_t} \mathbb E_{x_0\sim p_0,\, x_t\sim p_{t|0}(\cdot|x_0)} \big[\ell_{\mathrm{DSE}}(x_0, x_t, t, s_t)\big] = \mathbb E[\ell_{\mathrm{DSE}}(x_0, x_t, t, s_t^\star)]

with the conditional version

mdse(x0,t)=Extpt0(x0)[DSE(x0,xt,t,st)]\mathrm{mdse}(x_0, t) = \mathbb E_{x_t\sim p_{t|0}(\cdot|x_0)}[\ell_{\mathrm{DSE}}(x_0, x_t, t, s_t^\star)]

The core identity (I-MDSE, Theorem 3.1) is

ddtDKL(pt0(x0)pt)=mdse(x0,t)\frac{d}{dt} D_{\mathrm{KL}}(p_{t|0}(\cdot|x_0)\,\|\,p_t) = -\mathrm{mdse}(x_0, t)

Averaging over x0p0x_0\sim p_0 yields

ddtI(x0;xt)=mdse(t)\frac{d}{dt} I(x_0 ; x_t) = -\mathrm{mdse}(t)

This identity, proven via the path-space KL chain rule and Dynkin’s formula, asserts that the instantaneous rate of mutual information decay between data and diffused states is precisely the minimum DSE.

3. Log-Likelihood Decomposition and Exactness

Integrating the I-MDSE relation over time leads to the following time-integral decomposition (Theorem 3.2):

logp0(x0)=0Tmdse(x0,t)dt+DKL(pT0(x0)pT)-\log p_0(x_0) = \int_0^T \mathrm{mdse}(x_0, t)\,dt + D_{\mathrm{KL}}(p_{T|0}(\cdot|x_0)\,\|\,p_T)

For TT\to\infty and pTπp_T\to\pi,

logp0(x0)=0mdse(x0,t)dt-\log p_0(x_0) = \int_0^\infty \mathrm{mdse}(x_0, t)\,dt

This is an equality, not a variational bound. The area under the minimum DSE curve exactly recovers the negative log-likelihood of the initial data. Earlier score-matching approaches supplied only upper bounds; DSE provides tight, unbiased, and exact log-likelihood estimation in the discrete regime.

4. Properties as a Score and Likelihood Estimator

DSE is characterized by the following properties:

  • Tightness: The integral of minimum DSE precisely characterizes logp0-\log p_0, with mdse(t)0\mathrm{mdse}(t)\ge0 and no bias beyond model error if stθsts_t^\theta \approx s_t^\star.
  • Variance: The decomposition removes the “variational gap” found in prior methods, yielding lower Monte Carlo variance, especially in its time-free formulation. Empirically, time-free estimators exhibit up to an order-of-magnitude lower variance compared to time-integral approaches.
  • Comparative Stability: For discrete CTMC, DSE functions as the canonical, logarithmic-loss analogue to the MSE in Gaussian diffusion, yielding stable training and principled likelihood estimation, in contrast to unstable objectives such as plain 2\ell^2 on score ratios.

5. Practical Extensions and Applications

The I-MDSE framework enables several algorithmic constructions:

  • Time-Free Likelihood Estimation: For sequences of length LL, the estimator

logp0(x0)=HLEI[iIlog1p0(x0ix0I)]-\log p_0(x_0) = H_L\,\mathbb E_I\Big[\sum_{i\notin I} \log \frac{1}{p_0(x_0^i\,|\,x_0^I)}\Big]

with HLH_L the LLth harmonic number, and II sampled via a Beta-weighted scheme, provides a single-shot Monte Carlo estimator of log-likelihood with lower sample variance than direct time integration.

  • Conditional and Prompt-Response Likelihoods: For disjoint index sets I1I_1 (target) and I2I_2 (context):

logp0(x0I1x0I2)=HI1EJ[iI1Jlog1p0(x0ix0JI2)]-\log p_0(x_0^{I_1}|x_0^{I_2}) = H_{|I_1|}\,\mathbb E_J\Big[\sum_{i\in I_1\setminus J} \log \frac{1}{p_0(x_0^i| x_0^{J\cup I_2})}\Big]

This identity directly enables prompt-to-response or conditioned sequence likelihoods, as in language modeling.

  • Coupled Ratio Estimation: For two sequences x,yXLx, y \in \mathcal X^L,

logp0(y)p0(x)=HLEI[iI(logp0(yiyI)logp0(xixI))]\log \frac{p_0(y)}{p_0(x)} = H_L\,\mathbb E_I\Big[\sum_{i\notin I} (\log p_0(y^i|y^I) - \log p_0(x^i|x^I))\Big]

Shared mask coupling between xx and yy significantly reduces the variance of log-likelihood ratio estimates compared to independent estimators.

6. Empirical Performance and Experimental Results

Table: Empirical Findings for DSE-Based Estimators

Task/Domain Estimation Benchmark Variance Reduction
DNA sequence (length 8) Ground-truth log-probabilities Time-free estimator matches ground truth
Markov chain subsequence (length 32) Conditional log-likelihoods Exact match to Markov chain likelihood
HellaSwag, ARC-hard, PIQA Large-scale NLL estimation Variance falls $4$–6×6\times (Table 1)
BeaverTails (ratio estimator) Likelihood ratio estimation 7×7\times lower variance (Fig. 2(b))

In all large-scale benchmarks, approximately $100$ Monte Carlo samples per sequence suffice for high-precision, low-variance estimation, owing to the efficiency of the time-free and coupled approaches. In out-of-distribution detection (text8 vs. GPT-4 continuations), DSE-based negative log-likelihood reliably separates OOD and true samples. For model auditing (e.g., LLaDA 8B on WikiText vs. LLaMA-3.1), DSE reveals statistically significant NLL differences, suggesting model influence and aiding attribution studies (Jeon et al., 28 Oct 2025).

7. Significance and Theoretical Implications

Denoising Score Entropy provides the first exact, information-theoretic bridge between discrete diffusion score learning and log-likelihood evaluation. It eliminates the variational looseness of prior score-based objectives, giving rise to estimators that are not only unbiased but practical in terms of sample efficiency and variance reduction. This directly enables principled evaluation of large generative LLMs, conditional inference in prompt-response tasks, and robust uncertainty estimates in discrete generative modeling. A plausible implication is that DSE-based approaches will influence future work on likelihood-based evaluation and calibration of discrete generative architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Denoising Score Entropy (DSE).