Nested Diffusion Models

Updated 18 November 2025

Nested Diffusion Models are generative architectures that embed multiple diffusion processes in a hierarchical framework to capture multi-scale structure and conditional dependencies.
They improve sample quality and computational efficiency by decomposing complex distributions into semantically meaningful subspaces for image, text, and tabular data.
Key instantiations include hierarchical latent priors for image synthesis, anytime generation schemes, and diffusion-nested autoregressive synthesis, yielding significant performance gains.

Nested diffusion models constitute a class of generative architectures in which diffusion processes are composed or embedded within a broader structural hierarchy or outer modeling loop, producing significant gains in sample fidelity, flexibility, or computational efficiency. Distinct forms of nesting have been introduced for vision, text, and tabular data, sharing the unifying concept of leveraging one or more inner diffusion models inside a multi-level generative system. These architectures transcend flat diffusion approaches by explicitly decomposing complex distributions into semantically or structurally meaningful subspaces, often with each diffusion sub-model operating at a different level of abstraction, resolution, or conditional context. Key instantiations include the hierarchical latent priors framework for image synthesis (Zhang et al., 8 Dec 2024), the computationally efficient anytime image generation algorithm (Elata et al., 2023), and the diffusion-nested autoregressive synthesis for tabular data (Zhang et al., 28 Oct 2024).

1. Motivation and Principle of Nested Diffusion

Traditional diffusion models parameterize the distribution of a data point by simulating a single forward noising process and then learning to reverse this process through iterative denoising steps. While effective, this approach has intrinsic limitations in capturing multi-scale structure, producing high-quality partial outputs during sampling, and flexibly supporting complex conditional dependencies. Nested diffusion models address these issues by introducing one or several forms of nesting:

Hierarchical semantic decomposition: Each diffusion model in the hierarchy is responsible for a different level of abstraction, so that higher-level processes generate coarse semantic structure while lower levels add detail (Zhang et al., 8 Dec 2024).
Anytime generation and computational flexibility: Instead of a linear chain of denoising, outer sampler steps invoke inner diffusion solvers, with the result that valid, high-fidelity data samples are available at any intermediate stage (Elata et al., 2023).
Conditional, feature-wise modeling in tabular data: Diffusion processes are embedded as conditional sub-models, e.g., inside autoregressive architectures, to model continuous columns exactly within a permutation-invariant transformer (Zhang et al., 28 Oct 2024).

This nested composition confers modeling and algorithmic advantages, including improved sample quality, semantic control, computational amortization, and support for arbitrary conditioning or imputation tasks.

2. Hierarchical Latent Priors for Image Generation

The framework of "Nested Diffusion Models Using Hierarchical Latent Priors" systematically decomposes image synthesis into $L$ levels of abstraction. For levels $l=1$ (image) to $L$ (most abstract), the generative process samples latent variables $\{\mathbf z_l\}$ in a top-down manner. Each level $l$ employs a standalone diffusion model to sample $\mathbf z_l$ conditioned on all coarser-level latents $\mathbf z_{>l} := \{\mathbf z_{l+1},\dots,\mathbf z_L\}$ :

$p(\mathbf z_L, \mathbf z_{L-1}, ..., \mathbf z_1) = p(\mathbf z_L) \prod_{l=1}^{L-1} p(\mathbf z_l \mid \mathbf z_{>l})$

Target latents $\mathbf z_l$ are extracted from real images via a frozen, pretrained visual encoder, patchified at different scales for semantic granularity, then compressed with SVD and perturbed by level-dependent Gaussian noise (Zhang et al., 8 Dec 2024).

The forward (noising) process at each level is a Markov chain:

$q(\mathbf z_l^{(t)} \mid \mathbf z_l^{(t-1)}) = \mathcal{N}\!\left(\mathbf z_l^{(t)};\, \alpha_l^{(t)} \mathbf z_l^{(t-1)},\, \beta_l^{(t)} \mathbf I\right)$

Each reverse process is parameterized as:

$p_{\theta_l}\!\left(\mathbf z_l^{(t-1)} \mid \mathbf z_l^{(t)},\, \mathbf z_{>l}\right)$

Training is performed by maximizing the hierarchical evidence lower bound (ELBO):

$\begin{align*} \mathcal L_{\rm hierarchical\_ELBO} = &\sum_{l=1}^{L-1}\sum_{t=1}^T D_{\rm KL}\! \bigl( q(\mathbf z_l^{(t-1)}\mid \mathbf z_l^{(t)},\mathbf z_l) \;\|\; p_{\theta_l}(\mathbf z_l^{(t-1)}\mid\mathbf z_l^{(t)},\mathbf z_{>l}) \bigr) \ &+\; \sum_{t=1}^T D_{\rm KL}\! \bigl( q(\mathbf z_L^{(t-1)}\mid \mathbf z_L^{(t)},\mathbf z_L) \;\|\; p_{\theta_L}(\mathbf z_L^{(t-1)}\mid\mathbf z_L^{(t)}) \bigr) \end{align*}$

A simplified $L_2$ noise-prediction loss is used for practical training. Each diffusion UNet at level $l$ is conditioned on the clean outputs of all higher levels (non-Markovian coupling). Sampling proceeds top-down, recursively denoising each latent given its parent latents.

This architecture yields significant improvements in generation metrics (e.g., ~80% FID reduction on ImageNet-1K when increasing from $L=1$ to $L=5$ levels) with minimal computational overhead, as higher levels handle low-dimensional latents. The quality of the semantic encoder and the injection of noise at each level are crucial to avoid trivial autoencoding and maximize diversity and fidelity (Zhang et al., 8 Dec 2024).

3. Anytime Nested Diffusion Sampling

The algorithmic scheme of "Nested Diffusion Processes for Anytime Image Generation" defines a compositional structure in which each outer reverse-diffusion step is "expanded" into a full inner diffusion chain. The key property is that after every inner diffusion, the result $x_0'$ is a plausible sample on the learned data manifold, allowing the generation to be interrupted and decoded at arbitrary stages (Elata et al., 2023).

Let $N_o$ and $N_i$ denote the number of outer and inner steps; the total neural function evaluations (NFEs) is $N_o \times N_i$ . Tuning the ratio $R_{ND} = N_o / N_i$ allows users to trade off intermediate preview latency against per-preview quality.

Pseudocode for the sampling algorithm is as follows (in stylized format):

for t in outer_steps:
    x_prime = x_t
    for tau in inner_steps:
        x_0_prime = sample_inner_diffusion(x_prime)
        ...
    x_{t-next} = sample_outer_conditional(x_0_prime, x_t)

Empirically, at intermediate budgets (e.g., 20–80% of full NFEs), nested diffusion models produce previews with much lower FID than standard diffusion models stopped early, while final-sample FID remains comparable. The framework generalizes to text-to-image and inverse problems, supporting human-in-the-loop/interactive sampling and efficient early stopping (Elata et al., 2023).

4. Nested Diffusion in Autoregressive Tabular Synthesis

"Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data" introduces a hybrid architecture (TabDAR) where, for each continuous column in a table, a diffusion model is nested within an outer permutation-invariant autoregressive transformer (Zhang et al., 28 Oct 2024).

During training, random masked subsets of columns simulate arbitrary orderings, with a masked transformer providing context embeddings $\{z^i\}$ for each column. For continuous columns, the conditional distribution $p(x^i \mid z^i)$ is parameterized by a compact conditional diffusion model:

Forward process:

$x_t = x_0 + \sigma(t) \cdot \epsilon,\qquad \epsilon \sim \mathcal N(0,1)$

Typically, the VE (variance-exploding) SDE is used.

Reverse process (score-based):

$d x_t = -g^2(t) \nabla_{x_t} \log p(x_t)\,dt + g(t)\,d\omega_t$

At each autoregressive step, the corresponding continuous column is sampled via the learned reverse diffusion conditioned on the context vector $z^i$ . This nesting provides exact continuous modeling capability without discretization, supports arbitrary conditional sampling and missing-value imputation, and outperforms both pure autoregressive and pure diffusion-based approaches on tabular fidelity metrics (18–45% improvement on marginal K–S, joint correlation error, C2ST, and JSD across ten datasets) (Zhang et al., 28 Oct 2024).

5. Experimental Results across Domains

Nested diffusion architectures have been validated in a range of generative modeling tasks, with consistent evidence for enhanced quality or flexibility:

Hierarchical latent priors (image):
- Unconditional ImageNet: FID improves from ~55.4 ( $L=1$ ) to ~11.05 ( $L=5$ ).
- Class-conditional ImageNet: FID from 31.13 ( $L=1$ ) to 9.87 ( $L=5$ ), further reduced to 3.97 with classifier-free guidance.
- COCO text-to-image: $L=3$ , FID $\approx$ 6.97; $L=2$ with CFG gives FID $\approx$ 4.72, outperforming larger models.
- Computational overhead for $L=5$ : +27% GFlops over $L=1$ (Zhang et al., 8 Dec 2024).
Anytime image generation:
- At 20%–80% NFEs, nested outputs exhibit dramatically reduced FID (from 282.9→13 at 20% NFEs, compared to vanilla diffusion) (Elata et al., 2023).
- Final FID is competitive (nested 3.2 vs. vanilla 2.4) with superior intermediate outputs.
- Similar improvements are observed in text-to-image (Stable Diffusion) and inverse problems (inpainting, denoising).
Tabular data synthesis:
- TabDAR achieves 18–45% improvements over prior state-of-the-art on key statistical fidelity metrics at fixed model size and data budget (Zhang et al., 28 Oct 2024).

Ablation studies confirm the necessity of added latent noise, the benefits of advanced semantic encoders, and the tradeoffs in schedule, level count, and noise decay.

6. Limitations, Variants, and Research Directions

Nested diffusion models require careful selection of architectural depth, semantic encoders, and noise schedules. Limitations and open problems include:

Encoder dependence: Image-based hierarchies rely on pretrained encoders; in other modalities or domains where such models do not exist, quality may suffer or require manual engineering (Zhang et al., 8 Dec 2024).
Manual design of hierarchy: The number of levels, granularity, and associated schedules are currently hand-designed; automated structure learning remains an open topic.
Computational scaling: While overhead is minimized due to small abstract latent sizes, the linear growth in parameter count may constrain very deep hierarchies.
Generality: Extension to video and volumetric data, or to end-to-end jointly-learned encoders and generative models, is a topic for future research (Zhang et al., 8 Dec 2024).

Distinct instantiations suggest a broad applicability: for example, replacing unconditional inner diffusions with conditional versions for inverse problems, or supporting arbitrary user intervention during sampling (Elata et al., 2023). Nested diffusion architectures continue to stimulate work in multi-scale generative modeling, anytime generation, and fine-grained conditional synthesis.

7. Cross-Domain Synthesis and Conceptual Scope

The term "nested diffusion" encompasses a family of related but structurally distinct strategies: strictly hierarchical latent prior models in vision (Zhang et al., 8 Dec 2024), sequential compositions powering anytime generation (Elata et al., 2023), and per-feature-conditional diffusions nested within permutation-invariant transformers for tabular data (Zhang et al., 28 Oct 2024). While technical implementations differ, all exploit the expressiveness of diffusion-based denoising in modular, compositional, or recursive generative structures. This highlights a key trend in modern probabilistic modeling: nesting diffusion sub-processes within larger architectures yields improved modeling fidelity, flexibility, and controllability across data modalities.