Denoising Entropy: Methods & Applications

Updated 4 July 2026

Denoising entropy is a family of entropy-based measures used to characterize, control, and evaluate various denoising and reverse-diffusion processes.
Different formulations apply entropy at distinct stages, including internal control signals, reverse transition analysis, representation priors, and diagnostic evaluation maps.
Utilizing entropy metrics enables optimized denoising policies, enhanced sampling efficiency, and improved assessment of restoration quality in complex models.

Searching arXiv for the cited papers to ground the article in current preprints. Denoising entropy denotes a family of entropy-based quantities used to characterize, control, or evaluate denoising and reverse-diffusion processes. In current research, it does not name a single canonical observable. Instead, it refers to several distinct constructions: attention entropy inside a denoising trajectory, predictive entropy of a noise-aware classifier, conditional entropy of reverse transitions, wavelet-domain entropy of detail coefficients, entropy-coded latent complexity in compression-based restoration, and entropy-derived diagnostics such as directional anisotropy or entropy maps (Li et al., 6 Feb 2026, Li et al., 2022, Li et al., 30 Sep 2025, Rhee et al., 18 Jun 2026, Nguyen et al., 12 Feb 2026, Gabarda et al., 2011). This plurality is not merely terminological. Different formulations place entropy at different loci of the denoising pipeline: as an internal state variable, an optimization target, a prior on representational complexity, or a no-reference quality indicator.

1. Conceptual scope and principal formulations

The literature uses denoising entropy in at least four technically distinct senses. Some works measure entropy on model-internal distributions during iterative denoising, such as cross-attention over prompt tokens or classifier posteriors over classes. Others treat denoising as conditional-entropy reduction between adjacent reverse-time states. A third line places entropy in the representation itself, typically through wavelet-domain statistics or entropy-coded latent variables. A fourth uses entropy-derived maps or directional entropy to assess denoising quality after reconstruction (Li et al., 6 Feb 2026, Li et al., 30 Sep 2025, Rhee et al., 18 Jun 2026, Gabarda et al., 2011).

Entropy object	Operational role	Representative papers
Attention or predictive entropy during denoising	Online control of guidance, rollout allocation, or keyframe selection	(Li et al., 6 Feb 2026, Li et al., 2022, Chen et al., 29 Jun 2026)
Conditional entropy or KL between adjacent states	Reverse-process analysis and sampler design	(Li et al., 30 Sep 2025, Kim et al., 2024, Zhang et al., 1 Jan 2026)
Wavelet or latent-code entropy	Regularization and low-complexity priors	(Rhee et al., 18 Jun 2026, Nguyen et al., 12 Feb 2026)
Entropy-derived evaluation maps	No-reference assessment of residual noise and blur	(Gabarda et al., 2011, Boriskov et al., 15 Nov 2025)

This heterogeneity implies that “entropy” in denoising is best treated as a structural descriptor of uncertainty or complexity rather than as a fixed formula. In some papers it is an explicit Shannon- or KL-type quantity; in others it is a proxy derived from eigenspectra, code lengths, or local irregularity. This suggests that the unifying theme is not the specific entropy functional, but the use of entropy-like quantities to separate structured signal from nuisance randomness.

2. Entropy as an internal control signal in iterative denoising

A prominent recent use of denoising entropy is to monitor the internal state of a denoiser while sampling or reinforcement-learning it. In "AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models" (Li et al., 6 Feb 2026), attention entropy is computed from cross-attention maps between image features and text tokens. For timestep $t$ , the paper defines the local signal

$\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$

where each $\mathrm{Entropy}_t[q_i]$ is the Shannon entropy of the normalized attention distribution over text tokens for a fixed image feature. The paper then defines a policy-relative quantity

$\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$

used as a sample-level proxy for learning value. The two signals are operationally separated: absolute $\mathrm{Entropy}(t)$ identifies critical denoising moments, while $\Delta\mathrm{Entropy}$ measures deviation from the base policy and allocates more rollout budget to prompts with larger policy-attention shifts. The reported peak distribution is U-shaped or bimodal, with one cluster at very early steps and another at late steps, so valuable branching moments are not uniformly distributed across the trajectory. In all experiments, local exploration uses top- $K$ entropy peaks with $K=4$ , while global allocation uses $r_{\text{low}}=8$ , $r_{\text{high}}=16$ , and $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 0, after a 20-step warmup. On text-to-image alignment, the paper reports that entropy-guided branching improves Reward Std, LPIPS MPD, and TCE over fixed schedules, raises BranchGRPO on FLUX.1-dev from HPS-v2.1 $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 1 to $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 2, and yields 2× faster convergence on DanceGRPO and 5× faster convergence on DiffusionNFT despite an 11.1% per-step time increase and about 1 GB additional VRAM (Li et al., 6 Feb 2026).

In classifier-guided diffusion, entropy appears as predictive uncertainty rather than attention dispersion. "Entropy-driven Sampling and Training Scheme for Conditional Diffusion Generation" (Li et al., 2022) defines

$\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 3

the entropy of the noise-aware classifier’s class distribution at denoising step $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 4. The paper argues that classifier guidance often vanishes early because the classifier becomes overconfident before the image is fully denoised. Entropy is therefore used as a denoising-time indicator of semantic uncertainty. Sampling replaces a fixed guidance scale $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 5 with

$\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 6

so low predictive entropy increases the classifier-guidance magnitude. Training adds entropy regularization through

$\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 7

with $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 8. On ImageNet1000 $\mathrm{Entropy}(t) = \frac{1}{N}\sum_{i=1}^{N}\mathrm{Entropy}_t[q_i],$ 9, the paper reports FID improvements from $\mathrm{Entropy}_t[q_i]$ 0 to $\mathrm{Entropy}_t[q_i]$ 1 for CADM-G and from $\mathrm{Entropy}_t[q_i]$ 2 to $\mathrm{Entropy}_t[q_i]$ 3 for UADM-G under DDPM 250-step sampling (Li et al., 2022).

A related compute-allocation use appears in "EcoVideo: Entropy-Orchestrated Video Generation Paradigm in Cloud-Edge Dynamics" (Chen et al., 29 Jun 2026). There, early self-attention entropy is computed over the first $\mathrm{Entropy}_t[q_i]$ 4 of denoising steps, aggregated to frame-level scores by mean pooling over tokens and stabilized by EMA. High-entropy frames are treated as information-dense keyframes that deserve cloud-side denoising, while low-entropy frames are reconstructed on the edge by interpolation. The keyframe selection rule

$\mathrm{Entropy}_t[q_i]$ 5

turns entropy into a frame-wise denoising budget allocator. The paper reports that removing entropy-based keyframe selection lowers VBench from $\mathrm{Entropy}_t[q_i]$ 6 to $\mathrm{Entropy}_t[q_i]$ 7, while the full method achieves 1.84× end-to-end speedup on Wan2.1 and up to 2.9× in low-bandwidth, compute-limited settings (Chen et al., 29 Jun 2026).

3. Entropy of reverse transitions and denoising difficulty

Another major formulation treats denoising as the progressive reduction of uncertainty in reverse transitions. "EVODiff: Entropy-aware Variance Optimized Diffusion Inference" (Li et al., 30 Sep 2025) formalizes this through the conditional entropy

$\mathrm{Entropy}_t[q_i]$ 8

where $\mathrm{Entropy}_t[q_i]$ 9. Under the paper’s Gaussian approximation, minimizing conditional variance directly reduces conditional entropy. This gives an information-theoretic interpretation of diffusion inference: successful reverse denoising should shrink the conditional spread of plausible predecessor states. The paper further states that data prediction parameterization reduces reconstruction errors more effectively than noise prediction and, under independence assumptions, also reduces conditional entropy. EVODiff then optimizes stepwise variance-balancing coefficients $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 0 in a multistep data-prediction sampler. On CIFAR-10, it improves FID at 10 NFE from $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 1 with DPM-Solver++ to $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 2; on ImageNet-256, it reaches comparable high-quality sampling at 15 NFE where DPM-Solver++ needs 20 NFE (Li et al., 30 Sep 2025).

A complementary analysis appears in "Denoising Task Difficulty-based Curriculum for Training Diffusion Models" (Kim et al., 2024). That paper studies KL divergence between consecutive forward-process marginals,

$\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 3

as a distribution-level measure of denoising difficulty. The empirical finding is that this relative entropy decreases as $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 4 increases, so under the paper’s timestep convention smaller $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 5 corresponds to harder denoising tasks. This aligns with slower convergence at small $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 6 and motivates an easy-to-hard curriculum over timestep clusters. The reported gains include FFHQ $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 7 FID $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 8, ImageNet $\Delta\mathrm{Entropy} = \frac{1}{T}\sum_{t=1}^{T}\left|\mathrm{Entropy}_\theta(t)-\mathrm{Entropy}_{\mathrm{base}}(t)\right|,$ 9 FID $\mathrm{Entropy}(t)$ 0, and FFHQ $\mathrm{Entropy}(t)$ 1 FID $\mathrm{Entropy}(t)$ 2 (Kim et al., 2024).

In reinforcement learning for flow models, "E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models" (Zhang et al., 1 Jan 2026) defines denoising entropy as the differential entropy of a Gaussian reverse SDE transition: $\mathrm{Entropy}(t)$ 3 The paper argues that high-entropy steps enable more efficient and effective exploration, while low-entropy steps produce undistinguished roll-outs. Consecutive low-entropy steps are therefore merged into one larger stochastic step, and group-normalized advantages are computed only within samples sharing the same consolidated SDE step. The strongest ablation is that training on the first 8 denoising steps yields HPS $\mathrm{Entropy}(t)$ 4, compared with $\mathrm{Entropy}(t)$ 5 for the second 8 steps and $\mathrm{Entropy}(t)$ 6 for all 16 steps, supporting the claim that effective learning is concentrated in high-entropy denoising stages (Zhang et al., 1 Jan 2026).

Taken together, these works indicate a common interpretation: entropy is a measure of how much uncertainty or exploration remains in a reverse step, and high-entropy stages are disproportionately important for both solver design and policy optimization.

4. Entropy as a denoising prior in representation space

A different tradition places entropy not on model-internal trajectories but on the representation being denoised. "TIDY: Thermal Infrared Image Denoising via Wavelet Domain Entropy and Directional Stripe Index" (Rhee et al., 18 Jun 2026) moves denoising into the wavelet domain and defines Wavelet Entropy

$\mathrm{Entropy}(t)$ 7

where

$\mathrm{Entropy}(t)$ 8

Entropy is therefore computed over the distribution of wavelet-magnitude mass across the three directional detail subbands at each scale, not over pixel intensities. The paper’s motivation is that pixel-domain entropy conflates noise randomness with natural image intensity variations, whereas wavelet detail coefficients attenuate structural content and make entropy more selective for stochastic thermal noise. Because stripe-like fixed-pattern noise is not adequately captured by entropy, TIDY adds a separate Wavelet Directional Stripe Index and trains with

$\mathrm{Entropy}(t)$ 9

using $\Delta\mathrm{Entropy}$ 0, $\Delta\mathrm{Entropy}$ 1, and $\Delta\mathrm{Entropy}$ 2. The paper reports that adding $\Delta\mathrm{Entropy}$ 3 improves IRE performance from PSNR/SSIM $\Delta\mathrm{Entropy}$ 4 to $\Delta\mathrm{Entropy}$ 5, while the full DWT + FiLM + WE + WDSI model gives the best SCaN-TIR result at $\Delta\mathrm{Entropy}$ 6. The final model runs at about 34 Hz on $\Delta\mathrm{Entropy}$ 7 (Rhee et al., 18 Jun 2026).

"Perception-based Image Denoising via Generative Compression" (Nguyen et al., 12 Feb 2026) makes entropy the core denoising prior through entropy-coded latent representations. A lossy code $\Delta\mathrm{Entropy}$ 8 induces a codebook $\Delta\mathrm{Entropy}$ 9, and under additive Gaussian noise the compression-based ML denoiser becomes

$K$ 0

The paper interprets this as denoising by projection onto a compressible signal class. In the conditional WGAN-based instantiation, the latent code cost is

$K$ 1

and training minimizes

$K$ 2

In the diffusion-based instantiation, the objective is

$K$ 3

Here low code length is the denoising prior: structured image content is compressible, while nuisance noise is high-complexity and costly to represent. The paper also establishes a non-asymptotic AWGN bound stating that with probability at least $K$ 4,

$K$ 5

This makes the denoising error depend explicitly on compression distortion $K$ 6, coding rate $K$ 7, and noise strength $K$ 8 (Nguyen et al., 12 Feb 2026).

A cross-modal extension appears in "TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding" (Sobhani et al., 6 Jun 2026). There, denoising-trained encoder outputs are filtered to retain salient context vectors and then compressed with LZMA; at 20% Kizuki retention the reported compression ratio is 5.39×, while denoising training sharply improves BLEU, BERTScore, and perplexity relative to the no-denoising variant. This suggests that low-entropy bottlenecks can function as denoising priors beyond image restoration, although the paper does not formalize a general rate–distortion theorem for text (Sobhani et al., 6 Jun 2026).

5. Entropy for denoising assessment and diagnostics

Entropy is also used after denoising, as a diagnostic of whether structure has been preserved. "Image denoising assessment using anisotropic stack filtering" (Gabarda et al., 2011) defines a local directional Rényi entropy on thresholded binary stack levels and then measures anisotropy as the variation of directional entropy across orientations: $K$ 9 The central claim is that meaningful image structure is anisotropic, while random noise is more isotropic. Therefore more noise implies less anisotropy and better denoising implies larger $K=4$ 0. On a real SAR image, the paper reports $K=4$ 1 for the noisy input and $K=4$ 2 for the Kuan filter, with Frost, SRAD, and relaxed median in between. The metric is thus proposed as a no-reference indicator of denoising quality (Gabarda et al., 2011).

"Recursive Threshold Median Filter and Autoencoder for Salt-and-Pepper Denoising: SSIM analysis of Images and Entropy Maps" (Boriskov et al., 15 Nov 2025) introduces an entropy-domain complement to image-domain SSIM. Entropy maps are computed with 2D Sample Entropy in sliding windows, using $K=4$ 3, $K=4$ 4, and $K=4$ 5, and denoising quality is assessed by SSIM between restored and clean entropy maps, denoted SSIMMap. The paper’s key claim is that SSIMMap is more sensitive to blur and local intensity transitions than SSIMImg. This is especially clear in the low-resolution mushroom-edge example at $K=4$ 6 salt-and-pepper noise, where moving from a single $K=4$ 7 recursive median filter to the 2MF scheme changes SSIMImg from $K=4$ 8 to $K=4$ 9 but SSIMMap from $r_{\text{low}}=8$ 0 to $r_{\text{low}}=8$ 1. On the $r_{\text{low}}=8$ 2 Lena image at $r_{\text{low}}=8$ 3 noise, the MFs-AE scheme gives the best reported values, with SSIMImg $r_{\text{low}}=8$ 4 and SSIMMap $r_{\text{low}}=8$ 5. In this line of work, entropy is not the denoiser; it is the measurement domain in which blur, edge loss, and over-smoothing become easier to quantify (Boriskov et al., 15 Nov 2025).

These assessment-oriented papers make a different but important point. They imply that image-domain fidelity and entropy-domain fidelity are not equivalent. A restoration can look acceptable under grayscale SSIM while failing to preserve the local irregularity patterns that encode edges, textures, or anisotropic structure.

6. Terminological divergences, misconceptions, and theoretical extensions

One recurrent misconception is that every entropy-themed denoising paper directly optimizes Shannon entropy. That is not the case. "Noise Reversal by Entropy Quantum Computing" (Huang et al., 12 Feb 2025) uses “entropy quantum computing” to denote an open quantum/photonic optimization paradigm, but the paper explicitly does not derive a Shannon entropy, von Neumann entropy, KL divergence, maximum-entropy estimator, or explicit entropy functional for denoising. Its denoising formulation is instead a constrained combinatorial optimization over noise allocations $r_{\text{low}}=8$ 6 such that $r_{\text{low}}=8$ 7, with a spatial-correlation cost on the residual $r_{\text{low}}=8$ 8 (Huang et al., 12 Feb 2025).

A second ambiguity concerns cross-entropy losses. "On denoising autoencoders trained to minimise binary cross-entropy" (Creswell et al., 2017) is not a paper about Shannon entropy of the data distribution, but about BCE as a reconstruction objective. Its main theorem shows that under additive Gaussian corruption the optimal BCE-trained denoising autoencoder satisfies the same small-noise asymptotic relation as the MSE-trained case,

$r_{\text{low}}=8$ 9

so reconstruction minus input still points toward higher-density regions of data space (Creswell et al., 2017). Relatedly, "AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation" (Lim et al., 2020) uses denoising to estimate the score $r_{\text{high}}=16$ 0, which is then inserted into the pathwise entropy-gradient identity

$r_{\text{high}}=16$ 1

There, denoising is a route to entropy-gradient estimation, not a direct entropy-minimization denoiser (Lim et al., 2020).

At the most theoretical end, "A Free Probabilistic Framework for Denoising Diffusion Models: Entropy, Transport, and Reverse Processes" (Das, 26 Oct 2025) lifts the denoising-entropy relation into free probability. The forward free Ornstein–Uhlenbeck process increases Voiculescu free entropy according to the free de Bruijn identity

$r_{\text{high}}=16$ 2

while the reverse-time free SDE is driven by the conjugate variable $r_{\text{high}}=16$ 3,

$r_{\text{high}}=16$ 4

This replaces Gaussian noising by semicircular noising, Shannon entropy by free entropy, and the classical score by the conjugate variable (Das, 26 Oct 2025).

The breadth of these formulations suggests that denoising entropy is best understood as a family of entropy-mediated priors, control signals, and diagnostics rather than as a single invariant quantity. What unifies the field is the repeated use of entropy-like measures to discriminate structured, recoverable signal from randomness, ambiguity, or redundant computation. What remains variable is where that distinction is imposed: on trajectories, transitions, latents, wavelet subbands, or evaluation maps.