Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Denoising-Based Generation Strategy

Updated 30 June 2025
  • Denoising-based generation strategy is a paradigm that transforms random noise into structured outputs through iterative, learned denoising operations.
  • It employs methods like score-based modeling and infusion training to guide samples toward the true data manifold via progressive refinement.
  • This strategy has proven effective across modalities such as images and text, offering enhanced sample quality and stability compared to traditional GANs.

Denoising-based generation strategy refers to a family of approaches in generative modeling where the production of high-quality data samples is accomplished through a process of iterative or learned denoising. In such frameworks, generative models learn to map samples from initial random noise—typically drawn from a tractable distribution—progressively toward samples on or near the data manifold, effectively "denoising" noise into structured, realistic outputs. This paradigm is foundational to several major advancements in deep generative modeling, including diffusion models, certain Markov chain Monte Carlo methods, denoising autoencoders, and a variety of score-based models. Denoising-based generation has found considerable success across modalities—including images, text, 3D geometry, and more—and is central to contemporary state-of-the-art systems for sample generation, restoration, and data-driven synthesis.

1. Theoretical Foundations and Methodological Principles

Denoising-based generation strategies arise from the observation that many high-dimensional data distributions can be effectively traversed or explored by repeatedly mitigating the effects of injected noise. The central methodological principle is to transform an initial sample (often unstructured noise) into a data-like sample using a learned denoising map or a sequence of such maps.

Key theoretical underpinnings include:

  • Markov chains in data space, implemented via learned transition operators (1703.06975).
  • Score-based modeling, where the gradient of the log-density (score function) is approximated and used to iteratively shift samples toward regions of higher data likelihood.
  • Progressive refinement—each denoising operation brings the sample distribution closer to that of the data, often formalized in terms of projections or optimization dynamics.

A representative formulation, as seen in "Learning to Generate Samples from Noise through Infusion Training" (1703.06975), involves learning a stochastic transition operator p(t)(z(t)z(t1))p^{(t)}(\mathbf{z}^{(t)} \mid \mathbf{z}^{(t-1)}) so that, when applied to an initial noise sample z(0)p(0)\mathbf{z}^{(0)} \sim p^{(0)}, repeated application yields a final sample z(T)\mathbf{z}^{(T)} matching the data distribution.

2. Markov Chain Transition Operators and Progressive Denoising

In the Markovian denoising-based generation framework, the generative process is defined by a sequence: z(0)p(0)(z(0))\mathbf{z}^{(0)} \sim p^{(0)}(\mathbf{z}^{(0)})

z(t)p(t)(z(t)z(t1)),t=1,,T\mathbf{z}^{(t)} \sim p^{(t)}(\mathbf{z}^{(t)} | \mathbf{z}^{(t-1)}), \quad t = 1, \ldots, T

where each p(t)p^{(t)} is typically a simple factorial distribution (e.g., diagonal Gaussian).

The innovation in infusion training (1703.06975) is a training procedure in which, instead of training the denoising operator solely against synthetic corruptions, the Markov chain is "infused" at each step with partial information from the target data point. The training distribution at each chain step becomes: qi(t)(z~i(t)z~(t1),x)=(1α(t))pi(t)(z~i(t)z~(t1))+α(t)δxi(z~i(t))q_i^{(t)}(\tilde{z}_i^{(t)} | \tilde{\mathbf{z}}^{(t-1)}, \mathbf{x}) = (1 - \alpha^{(t)}) p_i^{(t)}(\tilde{z}_i^{(t)} | \tilde{\mathbf{z}}^{(t-1)}) + \alpha^{(t)} \delta_{\mathbf{x}_i}(\tilde{z}_i^{(t)}) with α(t)\alpha^{(t)} the "infusion rate." This “cheating” enables the transition operator to learn to denoise from partially informed states and is central to making the Markov chain learnable.

Importantly, during generation (sampling), no target is available, and only the learned transition operator p(t)p^{(t)} is applied repeatedly to pure noise.

3. Learning and Likelihood Estimation

The training objective for the transition operator is to maximize the log-probability of the data point given the partially infused representation at each step: θ(t)θ(t)+η(t)θ(t)logp(t)(xz~(t1);θ(t))\theta^{(t)} \leftarrow \theta^{(t)} + \eta^{(t)} \frac{\partial}{\partial \theta^{(t)}} \log p^{(t)}(\mathbf{x} | \tilde{\mathbf{z}}^{(t-1)}; \theta^{(t)}) This can be interpreted as progressive denoising: given a state that is part noise, part data, predict an even cleaner version.

For evaluation, the exact likelihood of a data point is intractable but can be estimated via:

  • Importance sampling:

logp(x)=logEq(z~x)[p(z~,x)q(z~x)]\log p(\mathbf{x}) = \log \mathbb{E}_{q(\tilde{\mathbf{z}}|\mathbf{x})} \left[ \frac{p(\tilde{\mathbf{z}}, \mathbf{x})}{q(\tilde{\mathbf{z}}|\mathbf{x})} \right]

  • Variational lower bound:

logp(x)Eq(z~x)[logp(z~,x)logq(z~x)]\log p(\mathbf{x}) \geq \mathbb{E}_{q(\tilde{\mathbf{z}}|\mathbf{x})} \left[ \log p(\tilde{\mathbf{z}}, \mathbf{x}) - \log q(\tilde{\mathbf{z}}|\mathbf{x}) \right]

These estimates reveal that the method is quantitatively competitive or superior to contemporary GANs and prior diffusion models (1703.06975).

4. Experimental Performance and Empirical Properties

Empirical results (1703.06975) demonstrate the efficacy of denoising-based generation approaches:

  • Datasets: MNIST, TFD, CIFAR-10, and CelebA.
  • Qualitative sample quality: The Markov chain produces images that transition from unstructured noise to data-like samples with high diversity and sharpness (see Figures 1, 2, and 5). The progression is visually interpretable as a sequence of denoising operations.
  • Inpainting: The learned operator can perform structured completion, e.g., filling the missing bottom half of a face given the top half (Figure 7).
  • Quantitative metrics: On MNIST, infusion training achieves Parzen window log-likelihood estimates of 312±1.7312 \pm 1.7 nats (vs.\ 225±2225\pm2 for GANs, 220±1.9220\pm1.9 for Sohl-Dickstein-style diffusion), and an importance sampling estimate of 1836.27±0.5511836.27 \pm 0.551, surpassing GAN and VAE baselines. Inception Score on CIFAR-10 (4.62) also improves over unsupervised GANs (4.36).

The model is trained with a single network (no alternating adversary), providing greater stability compared to GANs, and requires only a modest number of denoising steps (T15T \sim 15) compared to thousands in score-based diffusion.

Model Parzen estimate (nats, MNIST)
GAN 225±2225\pm2
Diffusion 220±1.9220\pm1.9
Infusion (ours) 312±1.7\mathbf{312}\pm1.7

5. Strengths, Limitations, and Comparisons

Advantages

  • Stability: Only one network to train; no adversarial dynamics.
  • Efficiency: Rapid convergence in a small number of Markov steps.
  • Direct data-space mapping: Progressive denoising operates in the observed data space rather than a latent code (no VAE-style bottleneck).
  • Quality and variety: Outperforms GANs and some diffusion models on metrics for both sample quality and diversity.

Limitations

  • Heuristic training target: Theoretical guarantees are heuristic rather than rigorous, though bounds can be used.
  • No explicit latent space: All generation is performed in data space, possibly reducing interpretability/manipulation.
  • Likelihood intractability: Requires stochastic estimation rather than exact computation.
  • Hyperparameter sensitivity: Quality is sensitive to the infusion rate α\alpha, its schedule, and chain length.
  • Training trade-offs: Too few steps or extreme infusion rates may degrade performance.

These strengths and limitations should be considered in context with other generative schemes, such as GANs (which require careful adversarial balancing) or Sohl-Dickstein diffusion (which is slow to converge but theoretically rigorous).

6. Extensions and Generalizations

The denoising-based generation paradigm has since informed the development of several classes of models:

  • Score-based diffusion models, which extend the iterative denoising notion with formal continuous-time stochastic differential equations and explicit score matching.
  • Denoising autoencoders for language and structured generation, leveraging corruption schemes tailored to new modalities (1804.07899, 1908.08206).
  • Denoising as projection onto data manifolds, with theoretical links to optimization landscapes and Gaussian projection.
  • Task-specific extensions, enabling applications to inpainting, data completion, and modalities beyond vision, as the Markovian denoising process can be customized to incorporate structured constraints or side-information.

7. Key Mathematical Formulations

The denoising-based generation strategy is formally characterized by several foundational equations:

qi(t)(z~i(t)z~(t1),x)=(1α(t))pi(t)(z~i(t)z~(t1))+α(t)δxi(z~i(t))q_i^{(t)}(\tilde{z}_i^{(t)} | \tilde{\mathbf{z}}^{(t-1)}, \mathbf{x}) = (1 - \alpha^{(t)}) p_i^{(t)}(\tilde{z}_i^{(t)} | \tilde{\mathbf{z}}^{(t-1)}) + \alpha^{(t)} \delta_{\mathbf{x}_i}(\tilde{z}_i^{(t)})

θ(t)θ(t)+η(t)θ(t)logp(t)(xz~(t1);θ(t))\theta^{(t)} \leftarrow \theta^{(t)} + \eta^{(t)} \frac{\partial}{\partial \theta^{(t)}} \log p^{(t)}(\mathbf{x} \mid \tilde{\mathbf{z}}^{(t-1)}; \theta^{(t)})

logp(x)=logEq(z~x)[p(z~,x)q(z~x)]\log p(\mathbf{x}) = \log \mathbb{E}_{q(\tilde{\mathbf{z}}|\mathbf{x})} \left[ \frac{p(\tilde{\mathbf{z}}, \mathbf{x})}{q(\tilde{\mathbf{z}}|\mathbf{x})} \right]

logp(x)Eq(z~x)[logp(z~,x)logq(z~x)]\log p(\mathbf{x}) \geq \mathbb{E}_{q(\tilde{\mathbf{z}}|\mathbf{x})} \left[ \log p(\tilde{\mathbf{z}}, \mathbf{x}) - \log q(\tilde{\mathbf{z}}|\mathbf{x}) \right]

These govern the sampling dynamics, training updates, and likelihood estimation.


Denoising-based generation provides a robust, theoretically motivated, and practically effective framework for unsupervised deep generative modeling. By recasting sample generation as progressive denoising under a learnable transition dynamic, it enables competitive synthesis of complex data, strong diversity, and stability advantages—attributes that have influenced a wide spectrum of later generative modeling research.