Papers
Topics
Authors
Recent
2000 character limit reached

Generative & Diffusion Models

Updated 26 January 2026
  • Generative and diffusion models are probabilistic frameworks that synthesize new data by reversing a gradual noise process through a learnable neural network.
  • They use stochastic differential equations and Markov chains to model forward noising and reverse denoising, achieving state-of-the-art results in image, audio, and graph generation.
  • By employing training objectives such as ELBO and denoising score matching, these models balance sample fidelity and diversity, enabling robust conditional generation.

Generative and diffusion models are a class of probabilistic models that learn to synthesize new data samples by modeling the process of gradually transforming structured data into noise and then reversing this process to generate realistic data. Drawing mathematical inspiration from non-equilibrium thermodynamics and stochastic differential equations, diffusion models have become a central paradigm in modern deep generative modeling, especially after the introduction of Denoising Diffusion Probabilistic Models (DDPMs). These models feature a forward process—typically, a Markov chain of progressively noised samples—and a reverse process, parameterized by a neural network, that aims to invert this trajectory step by step, restoring structure from noise. This reversible framework underpins state-of-the-art results in image, audio, video, and graph generation, providing both theoretical tractability and practical generative performance (Torre, 2023).

1. Mathematical Foundations and Process Dynamics

The core principle of diffusion generative models is the construction of a stochastic process that degrades data into noise over multiple steps, and a learnable reverse process that reverts noise into data. In the discrete-time case, the forward (noising) process is defined as a Markov chain: q(xtxt1)=N(xt;1βtxt1,βtI),t=1,,T,q(x_t|x_{t-1}) = \mathcal N(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I),\quad t=1,\dots,T, where βt(0,1)\beta_t \in (0,1) specifies a variance schedule—linearly or cosinely increasing in practice (Torre, 2023). As tTt \to T, xTN(0,I)x_T \approx \mathcal N(0,I). The corresponding continuous-time limit is a stochastic differential equation (SDE): dxt=12β(t)xtdt+β(t)dwt,d x_t = -\frac12 \beta(t) x_t\,dt + \sqrt{\beta(t)}\,dw_t, where wtw_t is a standard Wiener process (Torre, 2023).

The reverse process is also formulated as an SDE: dxt=[f(t,xt)g2(t)xtlogqt(xt)]dt+g(t)dwˉt,d x_t = [f(t,x_t) - g^2(t)\nabla_{x_t}\log q_t(x_t)]\,dt + g(t) \, d\bar w_t, with the crucial “score” term xtlogqt(xt)\nabla_{x_t}\log q_t(x_t) representing the gradient of the log-density of xtx_t under the forward process (Torre, 2023, Ding et al., 2024). In practice, this is approximated using a neural network trained via denoising score matching.

The connection to the Fokker–Planck equation and the Ornstein–Uhlenbeck process formalizes diffusion models as PDE-driven generative mechanisms, with the reversibility guaranteed under mild regularity conditions (Cao et al., 28 Jan 2025).

2. Training Objectives and Loss Functions

The canonical training objective for diffusion models is derived either from a variational lower bound (ELBO) or through denoising score matching.

ELBO View: The evidence lower bound decomposes into per-step KL divergences between the true forward-conditioned posterior and the model: logpθ(x0)t=1TEq(x0,xt)[KL(q(xt1xt,x0)pθ(xt1xt))].\log p_\theta(x_0) \geq -\sum_{t=1}^T \mathbb E_{q(x_0, x_t)}\left[\mathrm{KL}(q(x_{t-1}|x_t, x_0)\|p_\theta(x_{t-1}|x_t))\right]. For Gaussian transitions, this reduces to a mean squared error (MSE) objective on the noise: Lsimple=EtUnif[1,T],x0,ϵN(0,I)ϵϵθ(xt,t)2,\mathcal L_{\mathrm{simple}} = \mathbb E_{t\sim \mathrm{Unif}[1,T], x_0, \epsilon \sim \mathcal N(0, I)}\,\| \epsilon - \epsilon_\theta(x_t, t) \|^2, with xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon, αˉt=i=1t(1βi)\bar \alpha_t = \prod_{i=1}^t (1-\beta_i) (Torre, 2023, Ding et al., 2024, Zhen et al., 2024).

Score Matching: In the SDE framework, training is equivalent to approximating the score xtlogqt(xt)\nabla_{x_t} \log q_t(x_t) by a neural network sθ(xt,t)s_\theta(x_t, t) via denoising score matching (Ding et al., 2024, Yeğin et al., 2024). The learned score parameterizes the reverse SDE for generative sampling.

3. Sampling and Generation Algorithms

The generation process (reverse diffusion) is performed via stepwise ancestral sampling or solvers for reverse SDEs/ODEs. For each t=T,,1t=T,\dots,1:

  1. Predict ϵθ(xt,t)\epsilon_\theta(x_t, t).
  2. Compute μt=1αt(xt1αt1αˉtϵθ(xt,t))\mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t, t) \right).
  3. Sample zN(0,I)z \sim \mathcal N(0, I) (if t>1t>1), then set xt1=μt+βtzx_{t-1} = \mu_t + \sqrt{\beta_t} z (Torre, 2023, Ding et al., 2024).

Key hyperparameters:

  • TT: step count (commonly 1000\sim 1000; controls quality–speed trade-off).
  • βt\beta_t: variance schedule (linear, cosine).
  • Network: U-Net with residual blocks and attention layers; time tt embedded via sinusoidal positional encoding.

Deterministic/integrator-based samplers (DDIM, DPM-Solver) can accelerate inference by skipping steps or solving the reverse ODE for a smaller set of noise levels (Ding et al., 2024, Gallon et al., 2024).

4. Model Variants and Extensions

Key Families:

  • DDPM: Discrete-time, Gaussian noise, ϵ\epsilon-prediction.
  • Score-based Models (SGM): Direct score prediction for reverse-time SDE (continuous-time).
  • SDE/ODE sampling: Direct numerical SDE/ODE solvers (“probability flow ODE”) for more efficient or deterministic generation (Yeğin et al., 2024, Gallon et al., 2024).
  • DDIM: Deterministic, fewer-step non-Markovian sampler (Gallon et al., 2024).
  • Conditional Diffusion: Side information embedded as conditional inputs (class labels, text, spatial context) (Gallon et al., 2024).
  • Latent Diffusion (LDM): Diffusion in a learned low-dimensional latent space (typically VAE-encoded) for scalability.
  • Discrete Diffusion: For graph or categorical data, forward diffusion and reverse denoising operate on transition kernels over discrete state spaces (Wesego, 22 Jan 2025, Liu et al., 2023).

Architectural Strategies:

  • UNet backbone with attention (at high and mid-res layers).
  • Time and/or conditional embeddings via positional encodings or cross-attention.
  • Discrete diffusion employs GNNs or discrete Markov kernels in adjacency space for graphs.

Post-Training Modifications:

  • Distillation/Fast Sampling: Teacher–student schemes (progressive or consistency distillation) accelerate the reverse process to as few as 1–20 steps (Ding et al., 2024).
  • Reward Fine-Tuning: Integration of reward-guided objective terms (gradient or gradient-free) to maximize downstream utility functions in sampled outputs (Ding et al., 2024).

5. Generalization, Theoretical Guarantees, and Limitations

Diffusion models in analytical limit (perfect score and infinite capacity) do not generalize beyond the support of the training distribution, as their exact reverse process's support remains within the data manifold (Cao et al., 28 Jan 2025). In practice, the neural network approximation injects regularization and bias that enable mixing and the creation of new (interpolated) samples (Yi et al., 2023).

Recent formalizations use mutual information between generated outputs and the training dataset as a statistical measure of generalization. Excessively precise models (analytical optima) "memorize" the training set; neural models generalize due to approximation error and optimization bias (Yi et al., 2023). Alternative loss functions can trade off sample diversity (FID) and degree of memorization.

Sampling error, reverse SDE stability, and expressive power have been partially quantified—under mild regularity conditions, diffusion models are minimax optimal for smooth densities, and finite-step algorithms approach the continuous-time SDE as maxtβt0\max_t \beta_t \to 0 (Yeğin et al., 2024).

6. Domains and Applications

Diffusion-based generative models achieve state-of-the-art results in:

  • Image synthesis and editing: 256x256, 512x512 image generation, text-conditioned synthesis (Stable Diffusion, DALL-E 2).
  • Audio and speech: Waveform and spectrogram generation, speech synthesis.
  • Molecular and protein modeling: Molecular graph generation (drug design), protein backbone and structure synthesis; E(3)-equivariant GNNs for structural invariance (Liu et al., 2023).
  • Graph Learning: Discrete diffusion autoencoders enhance graph representation learning and enable principled discrete-sample generation (Wesego, 22 Jan 2025).
  • Sequential Recommendations: Modeling item embeddings as distributions via diffusion enhances modeling of user preferences in recommendation systems (Zolghadr et al., 2024).
  • Multi-modal generation: Unified frameworks handle joint generation and reconstruction of images, labels, representations, inpainting masks via shared-latent, multi-modal diffusion (Chen et al., 2024).
  • Wireless communications, security, and semantic channels: Diffusion models enable robust semantic-level denoising, channel modeling, semantic communications, and cross-layer security for 6G networks (Fan et al., 22 Jul 2025).
  • Continual learning: Generative distillation of the reverse process enables continual learning without catastrophic forgetting, a regime intractable via standard GAN/autoencoder replay (Masip et al., 2023).

7. Comparative Analysis and Future Directions

Relative to other generative paradigms:

  • GANs produce sharp samples but risk mode collapse and unstable adversarial training.
  • VAEs yield tractable likelihoods and fast sampling but often produce blurrier outputs.
  • Normalizing flows guarantee invertibility and exact likelihoods but are constrained by transformation families.

Diffusion models combine stability, likelihood-based variational objectives, easy conditionalization, and strong empirical fidelity/diversity, but at the cost of high computational complexity and inference latency (due to long trajectories). Fast sampling (e.g., DDIM, DPM-Solver), lightweight architectures, and model compression are active areas of development (Ding et al., 2024, Fan et al., 22 Jul 2025). Open theoretical areas include sharper ELBO bounds, understanding generalization dynamics, scaling beyond Gaussian priors to richer domains, and advancing manifold/discrete-data SDEs.

In summary, generative and diffusion models constitute a mathematically principled, empirically validated methodology for probabilistic data synthesis, unifying tools from stochastic processes, variational inference, and deep learning for diverse domains and modalities (Torre, 2023, Ding et al., 2024, Gallon et al., 2024, Cao et al., 28 Jan 2025, Yeğin et al., 2024, Yi et al., 2023, Wesego, 22 Jan 2025, Fan et al., 22 Jul 2025, Chen et al., 2024, Masip et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative and Diffusion Models.