Papers
Topics
Authors
Recent
Search
2000 character limit reached

Information-Minimum Denoising Score Entropy

Updated 8 February 2026
  • I-MDSE is a theoretical framework that precisely relates the instantaneous decay of mutual information to the minimum DSE loss in discrete diffusion models.
  • It establishes an exact likelihood decomposition, analogous to the I-MMSE identity, by integrating the DSE loss to recover negative log-likelihood without variational gaps.
  • The framework enables practical likelihood estimation in score-based generative modeling through continuous-time Markov processes and is validated with empirical studies.

The Information-Minimum Denoising Score Entropy (I-MDSE) is a fundamental theoretical relation in discrete diffusion models, establishing an exact connection between the rate of mutual information decay in a continuous-time Markov process and the minimum achievable value of the Denoising Score Entropy (DSE) loss function. Analogous to the I-MMSE identity for Gaussian diffusion, I-MDSE enables a tight, time-integral decomposition of the negative log-likelihood in discrete data spaces. This framework provides both rigorous justification and practical tools for likelihood estimation in discrete score-based generative modeling, without loose variational approximations (Jeon et al., 28 Oct 2025).

1. Discrete Diffusion, DSE Loss, and Score Estimation

The I-MDSE framework is formulated within the discrete diffusion paradigm where the data space X\mathcal{X} is a finite set, with possible extension to sequences as XL\mathcal{X}^L. The forward process consists of a continuous-time Markov chain (CTMC) governed by a transition rate matrix Qt=σ(t)QQ_t = \sigma(t) Q, where σ(t)\sigma(t) is a nonnegative, smooth rate schedule. The process evolves the initial data distribution p0=pdatap_0 = p_{\text{data}} towards a stationary distribution π\pi as tt\to\infty.

In this context, a score network stθ(x)ys_t^\theta(x)_y is trained to approximate the marginal ratio pt(y)/pt(x)p_t(y)/p_t(x) for yxy\neq x. The central training loss is the Denoising Score Entropy (DSE) loss, defined pointwise for clean x0x_0 and noisy xt=xx_t = x as:

DSE(x0,x,t;st)=yxQt(x,y)[st(x)ypt0(yx0)pt0(xx0)logst(x)y+K(pt0(yx0)pt0(xx0))]\ell_{DSE}(x_0, x, t; s_t) = \sum_{y\neq x} Q_t(x, y) \left[ s_t(x)_y - \frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)} \log s_t(x)_y + K\left(\frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)}\right) \right]

where K(a)=a(loga1)K(a) = a(\log a - 1). At the optimal value st(x)y=pt(y)/pt(x)s_t^*(x)_y = p_t(y)/p_t(x), the loss attains zero in terms of first-order optimality. The pointwise minimum DSE is:

mdse(x0,t)=Expt0(x0)[DSE(x0,x,t;st)]\operatorname{mdse}(x_0, t) = \mathbb{E}_{x\sim p_{t|0}(\cdot|x_0)}[\ell_{DSE}(x_0, x, t; s_t^*)]

and the marginal variant is

mdse(t)=Ex0p0[mdse(x0,t)].\operatorname{mdse}(t) = \mathbb{E}_{x_0\sim p_0}[\operatorname{mdse}(x_0, t)].

2. The I-MDSE Theorem: Information Decay and Score Entropy

The central result, the I-MDSE theorem, expresses the instantaneous rate of mutual information decay as the negative of the minimum denoising score entropy:

  • Pointwise: For every t0t\geq 0 and x0Xx_0\in\mathcal{X},

ddt[KL(pt0(x0)pt)]=mdse(x0,t)\frac{d}{dt} \left[\mathrm{KL}(p_{t|0}(\cdot | x_0) \Vert p_t)\right] = -\operatorname{mdse}(x_0, t)

  • Marginal: Taking expectation over x0p0x_0 \sim p_0 yields

ddtI(x0;xt)=mdse(t)\frac{d}{dt} I(x_0 ; x_t) = -\operatorname{mdse}(t)

These identities show that the minimum DSE loss captures the exact, instantaneous loss of information about x0x_0 under the forward diffusion at time tt. Since mdse(t)0\operatorname{mdse}(t)\geq 0, mutual information I(x0;xt)I(x_0; x_t) is monotonically decreasing in tt as expected for diffusion processes.

3. Derivation and Mathematical Foundations

The derivation proceeds by analyzing the time derivative of KL(pt0(x0)pt)\mathrm{KL}(p_{t|0}(\cdot|x_0) \Vert p_t) for a CTMC using the chain rule for path-measure KL and Dynkin's formula:

ddtKL(pt0(x0)pt)=Expt0(x0)[yxQt(x,y)logpt(y)pt0(xx0)pt(x)pt0(yx0)]\frac{d}{dt} \mathrm{KL}(p_{t|0}(\cdot|x_0) \Vert p_t) = \mathbb{E}_{x \sim p_{t|0}(\cdot | x_0)} \left[ \sum_{y \neq x} Q_t(x, y) \log \frac{p_t(y) p_{t|0}(x|x_0)}{p_t(x) p_{t|0}(y|x_0)} \right]

Algebraic manipulations reveal that, evaluated at the true ratio sts_t^*, the DSE loss matches this expectation up to sign. Rearranging yields the I-MDSE identity.

4. Log-Likelihood Decomposition and Tightness

Integrating the I-MDSE identity from t=0t=0 to some TT yields:

logp0(x0)=0Tmdse(x0,t)dt+KL(pT0(x0)pT)-\log p_0(x_0) = \int_0^T \operatorname{mdse}(x_0, t) dt + \mathrm{KL}(p_{T|0}(\cdot|x_0) \Vert p_T)

As TT\to\infty and pTπp_T \to \pi, provided π\pi is independent of x0x_0, the terminal KL term vanishes or is computable. The decomposition is then

logp0(x0)=0mdse(x0,t)dt-\log p_0(x_0) = \int_0^\infty \operatorname{mdse}(x_0, t) dt

This result is exact—unlike typical variational bounds on negative log-likelihood, no looseness remains when the optimal sts_t^* is used. In applied settings, a neural stθs_t^\theta approximates sts_t^*, introducing only estimator error, not variational gap.

5. Practical Estimation and Empirical Deployment

Estimation of logp0(x0)-\log p_0(x_0) in practice proceeds as follows:

  1. Time discretization: Choose grid 0=t0<t1<<tMT0 = t_0 < t_1 < \dots < t_M \to T or \infty.
  2. Sampling: For each tkt_k, sample xtkptk0(x0)x_{t_k} \sim p_{t_k|0}(\cdot|x_0) using the CTMC's known forward law (direct sampling, thinning, etc.).
  3. Loss evaluation: Compute DSE(x0,xtk,tk;stkθ)\ell_{DSE}(x_0, x_{t_k}, t_k; s_{t_k}^\theta).
  4. Integration: Form the Riemann sum kDSE(tk+1tk)\sum_k \ell_{DSE} \cdot (t_{k+1} - t_k). As MM\to\infty, this converges to the exact log-likelihood integral.

Empirical evaluations, including on synthetic DNA-alphabet CTMCs, demonstrate that this estimator closely matches ground-truth likelihoods. The score network trained via DSE simultaneously learns the required integrand for likelihood estimation (Jeon et al., 28 Oct 2025).

6. Underlying Assumptions and Theoretical Significance

The I-MDSE identity, as well as its log-likelihood decomposition, rely on several conditions:

  • The forward diffusion is a CTMC with smooth, time-dependent rates.
  • The DSE loss DSE\ell_{DSE} is minimized exactly at the true marginal ratio st=pt(y)/pt(x)s_t^* = p_t(y)/p_t(x).
  • The forward process approaches a stationary distribution π\pi as tt\to\infty, with π\pi independent of x0x_0 to ensure the terminal KL term vanishes or is trivial.

Provided these hold, I-MDSE is an equality, not a bound. This contrasts with standard variational inference, which incurs additional slack. The result thus theoretically validates score-matching for discrete diffusion as a tight likelihood estimator.

7. Extensions and Broader Context

I-MDSE is part of a broader information-theoretic framework for discrete diffusion, which also includes the Information-Minimum Denoising Cross-Entropy (I-MDCE) relation for masked processes and conditional likelihoods. These developments allow time-free likelihood estimation, likelihood-ratio estimation via Monte Carlo coupling, and principled treatment of prompt-response or in-context prediction tasks (Jeon et al., 28 Oct 2025). The I-MDSE result is the discrete diffusion analogue of the classical I-MMSE identity in Gaussian settings, thus generalizing information-theoretic score-based learning beyond the continuous domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Minimum Denoising Score Entropy (I-MDSE).