Information-Minimum Denoising Score Entropy

Updated 8 February 2026

I-MDSE is a theoretical framework that precisely relates the instantaneous decay of mutual information to the minimum DSE loss in discrete diffusion models.
It establishes an exact likelihood decomposition, analogous to the I-MMSE identity, by integrating the DSE loss to recover negative log-likelihood without variational gaps.
The framework enables practical likelihood estimation in score-based generative modeling through continuous-time Markov processes and is validated with empirical studies.

The Information-Minimum Denoising Score Entropy (I-MDSE) is a fundamental theoretical relation in discrete diffusion models, establishing an exact connection between the rate of mutual information decay in a continuous-time Markov process and the minimum achievable value of the Denoising Score Entropy (DSE) loss function. Analogous to the I-MMSE identity for Gaussian diffusion, I-MDSE enables a tight, time-integral decomposition of the negative log-likelihood in discrete data spaces. This framework provides both rigorous justification and practical tools for likelihood estimation in discrete score-based generative modeling, without loose variational approximations (Jeon et al., 28 Oct 2025).

1. Discrete Diffusion, DSE Loss, and Score Estimation

The I-MDSE framework is formulated within the discrete diffusion paradigm where the data space $\mathcal{X}$ is a finite set, with possible extension to sequences as $\mathcal{X}^L$ . The forward process consists of a continuous-time Markov chain (CTMC) governed by a transition rate matrix $Q_t = \sigma(t) Q$ , where $\sigma(t)$ is a nonnegative, smooth rate schedule. The process evolves the initial data distribution $p_0 = p_{\text{data}}$ towards a stationary distribution $\pi$ as $t\to\infty$ .

In this context, a score network $s_t^\theta(x)_y$ is trained to approximate the marginal ratio $p_t(y)/p_t(x)$ for $y\neq x$ . The central training loss is the Denoising Score Entropy (DSE) loss, defined pointwise for clean $x_0$ and noisy $x_t = x$ as:

$\ell_{DSE}(x_0, x, t; s_t) = \sum_{y\neq x} Q_t(x, y) \left[ s_t(x)_y - \frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)} \log s_t(x)_y + K\left(\frac{p_{t|0}(y|x_0)}{p_{t|0}(x|x_0)}\right) \right]$

where $K(a) = a(\log a - 1)$ . At the optimal value $s_t^*(x)_y = p_t(y)/p_t(x)$ , the loss attains zero in terms of first-order optimality. The pointwise minimum DSE is:

$\operatorname{mdse}(x_0, t) = \mathbb{E}_{x\sim p_{t|0}(\cdot|x_0)}[\ell_{DSE}(x_0, x, t; s_t^*)]$

and the marginal variant is

$\operatorname{mdse}(t) = \mathbb{E}_{x_0\sim p_0}[\operatorname{mdse}(x_0, t)].$

2. The I-MDSE Theorem: Information Decay and Score Entropy

The central result, the I-MDSE theorem, expresses the instantaneous rate of mutual information decay as the negative of the minimum denoising score entropy:

Pointwise: For every $t\geq 0$ and $x_0\in\mathcal{X}$ ,

$\frac{d}{dt} \left[\mathrm{KL}(p_{t|0}(\cdot | x_0) \Vert p_t)\right] = -\operatorname{mdse}(x_0, t)$

Marginal: Taking expectation over $x_0 \sim p_0$ yields

$\frac{d}{dt} I(x_0 ; x_t) = -\operatorname{mdse}(t)$

These identities show that the minimum DSE loss captures the exact, instantaneous loss of information about $x_0$ under the forward diffusion at time $t$ . Since $\operatorname{mdse}(t)\geq 0$ , mutual information $I(x_0; x_t)$ is monotonically decreasing in $t$ as expected for diffusion processes.

3. Derivation and Mathematical Foundations

The derivation proceeds by analyzing the time derivative of $\mathrm{KL}(p_{t|0}(\cdot|x_0) \Vert p_t)$ for a CTMC using the chain rule for path-measure KL and Dynkin's formula:

$\frac{d}{dt} \mathrm{KL}(p_{t|0}(\cdot|x_0) \Vert p_t) = \mathbb{E}_{x \sim p_{t|0}(\cdot | x_0)} \left[ \sum_{y \neq x} Q_t(x, y) \log \frac{p_t(y) p_{t|0}(x|x_0)}{p_t(x) p_{t|0}(y|x_0)} \right]$

Algebraic manipulations reveal that, evaluated at the true ratio $s_t^*$ , the DSE loss matches this expectation up to sign. Rearranging yields the I-MDSE identity.

4. Log-Likelihood Decomposition and Tightness

Integrating the I-MDSE identity from $t=0$ to some $T$ yields:

$-\log p_0(x_0) = \int_0^T \operatorname{mdse}(x_0, t) dt + \mathrm{KL}(p_{T|0}(\cdot|x_0) \Vert p_T)$

As $T\to\infty$ and $p_T \to \pi$ , provided $\pi$ is independent of $x_0$ , the terminal KL term vanishes or is computable. The decomposition is then

$-\log p_0(x_0) = \int_0^\infty \operatorname{mdse}(x_0, t) dt$

This result is exact—unlike typical variational bounds on negative log-likelihood, no looseness remains when the optimal $s_t^*$ is used. In applied settings, a neural $s_t^\theta$ approximates $s_t^*$ , introducing only estimator error, not variational gap.

5. Practical Estimation and Empirical Deployment

Estimation of $-\log p_0(x_0)$ in practice proceeds as follows:

Time discretization: Choose grid $0 = t_0 < t_1 < \dots < t_M \to T$ or $\infty$ .
Sampling: For each $t_k$ , sample $x_{t_k} \sim p_{t_k|0}(\cdot|x_0)$ using the CTMC's known forward law (direct sampling, thinning, etc.).
Loss evaluation: Compute $\ell_{DSE}(x_0, x_{t_k}, t_k; s_{t_k}^\theta)$ .
Integration: Form the Riemann sum $\sum_k \ell_{DSE} \cdot (t_{k+1} - t_k)$ . As $M\to\infty$ , this converges to the exact log-likelihood integral.

Empirical evaluations, including on synthetic DNA-alphabet CTMCs, demonstrate that this estimator closely matches ground-truth likelihoods. The score network trained via DSE simultaneously learns the required integrand for likelihood estimation (Jeon et al., 28 Oct 2025).

6. Underlying Assumptions and Theoretical Significance

The I-MDSE identity, as well as its log-likelihood decomposition, rely on several conditions:

The forward diffusion is a CTMC with smooth, time-dependent rates.
The DSE loss $\ell_{DSE}$ is minimized exactly at the true marginal ratio $s_t^* = p_t(y)/p_t(x)$ .
The forward process approaches a stationary distribution $\pi$ as $t\to\infty$ , with $\pi$ independent of $x_0$ to ensure the terminal KL term vanishes or is trivial.

Provided these hold, I-MDSE is an equality, not a bound. This contrasts with standard variational inference, which incurs additional slack. The result thus theoretically validates score-matching for discrete diffusion as a tight likelihood estimator.

7. Extensions and Broader Context

I-MDSE is part of a broader information-theoretic framework for discrete diffusion, which also includes the Information-Minimum Denoising Cross-Entropy (I-MDCE) relation for masked processes and conditional likelihoods. These developments allow time-free likelihood estimation, likelihood-ratio estimation via Monte Carlo coupling, and principled treatment of prompt-response or in-context prediction tasks (Jeon et al., 28 Oct 2025). The I-MDSE result is the discrete diffusion analogue of the classical I-MMSE identity in Gaussian settings, thus generalizing information-theoretic score-based learning beyond the continuous domain.

Markdown Report Issue Upgrade to Chat

References (1)

Information-Theoretic Discrete Diffusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Minimum Denoising Score Entropy (I-MDSE).