Hierarchical Multiscale Diffuser

Updated 13 April 2026

HM-Diffuser is a generative model that uses a hierarchical, multilevel diffusion process to capture both coarse and fine details in structured data.
It applies to image synthesis, trajectory planning, and video generation, leveraging domain-specific hierarchies like spatial patches and temporal subgoals.
Empirical results demonstrate improved inference speed, superior sample quality, and robust out-of-distribution performance compared to single-scale diffusion models.

The Hierarchical Multiscale Diffuser (HM-Diffuser) is a class of generative models that employs a multilevel, scale-adaptive diffusion process for more efficient modeling of high-dimensional, structured data. This paradigm has been instantiated across generative image synthesis, trajectory modeling in reinforcement learning, and high-resolution video generation. Central to HM-Diffuser is the decomposition of the generative process into hierarchical stages, each responsible for progressively finer-grained aspects of the target signal, thereby improving computational efficiency, generalization, and sample quality.

1. Conceptual Foundations and Motivation

HM-Diffuser architectures generalize classical diffusion models by introducing explicit hierarchy in the denoising process. Rather than iteratively removing noise from an entire high-dimensional signal in a flat, single-scale manner, the HM-Diffuser partitions the generative process:

Spatial hierarchy: applied to images and videos via base/residual or patch-based decomposition, enabling modeling of coarse structure before fine detail (Xu et al., 23 Jan 2025, Skorokhodov et al., 2024).
Temporal hierarchy: applied to planning and control, where a high-level (sparse) plan specifies subgoals, and low-level diffusers refine the details between subgoals (Chen et al., 2024, Chen et al., 25 Mar 2025).

This approach draws conceptual parallels to multiresolution analysis techniques in signal processing (e.g., wavelets), but uses deep generative diffusion processes conditioned at each hierarchical stage.

2. Mathematical Formulation and Hierarchical Structure

The HM-Diffuser decomposes the generation target into a set of levels or scales. At each scale $\ell$ (or stage $i$ ), a diffusion model is trained on a lower-dimensional representation or a sparser abstraction of the data, then refined at the next finer level. The mathematical structure varies by application domain:

Image Synthesis (Two-Level Example):

Let $\mathbf{x}\in\mathbb{R}^{H\times W \times 3}$ be an input image, and $\mathcal{E}$ a VAE encoder.
The latent representation is factorized into a base ( $z_\mathrm{base}$ , low-frequency) and a residual ( $z_\mathrm{res}$ , high-frequency) component.
For each, a continuous-time rectified flow loss is minimized:

$\mathcal L_i = \int_{0}^{1} \mathbb{E}\bigl[\|f_\mathrm{res}^{h_i} - z_0 - v_\theta(z_t, t \mid \cdots)\|^2\bigr] dt,$

summed across levels $i$ (Xu et al., 23 Jan 2025).

Planning and Trajectory Generation:

At the coarsest temporal scale, a high-level diffusion model generates a sparse sequence of subgoals over long horizons.
At each interval between subgoals, a lower-level diffusion model generates the fine-grained segments, conditioning on the start/end subgoals.
The joint process is formalized as hierarchical denoising within the Denoising Diffusion Probabilistic Model (DDPM) framework, with per-level conditional distributions

$p_{\theta_\ell}(\tau^\ell \mid g_0^\ell, g_{k_\ell}^\ell),$

where $\tau^\ell$ is the subtrajectory at level $i$ 0 with endpoints fixed (Chen et al., 25 Mar 2025, Chen et al., 2024).

Video Generation (Multiscale Patch Diffusion):

The video is never processed as a whole. Instead, for $i$ 1 scales, at each scale $i$ 2 the process operates on patches $i$ 3 of increasingly higher resolution.
Forward and reverse processes are parameterized to diffuse and denoise within and across levels, with tight coupling via context fusion mechanisms (Skorokhodov et al., 2024).

3. Architectural Variants and Implementation

Implementations of HM-Diffuser vary with domain and target signal:

Image domain: Two-stage transformer backbones are commonly used, with the base model operating on a downsampled (coarse) latent and the residual model refining at higher resolution. Each model can use self-attention and class-conditioning. Notably, residuals are empirically easier to model, allowing substantial reduction in inference steps for the refinement stage (Xu et al., 23 Jan 2025).
Trajectory and planning: Distinct high-level (sparse, subgoal) and low-level (dense, short-segment) diffusers are trained. Hierarchies can be encoded recursively or with explicit parameter sharing. Plan refinement uses parallel or sequential calls to the low-level diffusers for each coarse interval (Chen et al., 25 Mar 2025, Chen et al., 2024).
Video synthesis: At each patch scale, the model includes transformer/U-Net blocks with cross-scale context fusion. Adaptive computation dynamically assigns more resources to coarser levels, exploiting the relative simplicity of fine-detail synthesis. Patches are aligned using coordinate-aware feature propagation (Skorokhodov et al., 2024).

A summary of variant architectures:

Domain	Hierarchy Type	Backbone	Conditioning
Image	Base/Residual	Transformer	Class tokens
RL/Planning	Temporal (subgoals)	Transformer	Return predictor, subgoals
Video	Patch hierarchy	RIN+Transf.	Deep feature context

4. Training Objectives and Sampling Algorithms

Each HM-Diffuser model is trained with per-level denoising objectives, typically leveraging mean-squared error on noise prediction with appropriately scheduled perturbation:

Image/Latent: Continuous or discrete-time diffusion losses for each decomposition stage, with parameters trained either independently or in a staged/tied fashion.
Planning: Joint training of high- and low-level diffusers, with subgoal conditioning. Return predictors are included for guidance, and the entire plan is composed by concatenating subtrajectories generated at each hierarchical tier (Chen et al., 2024).
Video: Losses are minimized per patch scale with context fusion; total loss sums denoising reconstruction across all scales, with schedule-aligned noise variance (Skorokhodov et al., 2024).

Sampling typically follows a coarse-to-fine paradigm. At inference, each model:

Generates a coarse (low-frequency/long-horizon) solution.
Successively refines the output using finer-level diffusers, conditioned on appropriate context (upscaling, subgoals, feature fusion).
Decodes or stitches outputs to obtain the final high-resolution or long-horizon artifact (Xu et al., 23 Jan 2025, Chen et al., 25 Mar 2025, Skorokhodov et al., 2024).

5. Empirical Performance and Comparative Insights

Extensive empirical benchmarking confirms substantial efficiency and quality advantages of HM-Diffuser frameworks compared to flat, single-level diffusion models.

Model	Params	FID ↓	IS ↑	Time (ms)
DiT-XL/2	675 M	2.27	278.2	145
HM-Diffuser (MSF)	767 M	2.20	254.7	68

HM-Diffuser achieves $i$ 42.1× faster inference at matched or better FID (2.20 at 68ms).

Maze2D Medium: Diffuser $i$ 5121, HM-Diffuser 136 (+12%).
AntMaze Large: Diffuser $i$ 6, HM-Diffuser $i$ 7.
Planning speed: single-scale Diffuser $i$ 810s/plan; HM-Diffuser $i$ 91s/plan ( $\mathbf{x}\in\mathbb{R}^{H\times W \times 3}$ 0 speedup).
Multi-task compositional generalization: HM-Diffuser achieves $\mathbf{x}\in\mathbb{R}^{H\times W \times 3}$ 1 OOD success where all flat Diffusers fail.

Model	FVD ↓	IS ↑
PVDM (’23)	343.6	74.4
HPDM-L	66.32	87.68

HPDM achieves $\mathbf{x}\in\mathbb{R}^{H\times W \times 3}$ 2 FVD improvement over prior patch or latent diffusion models on UCF-101 $\mathbf{x}\in\mathbb{R}^{H\times W \times 3}$ 3.

Empirically, residual/fine-grained levels require fewer denoising steps, permitting faster generation. Hierarchical structure yields superior generalization, especially to long-horizon or OOD scenarios (Chen et al., 25 Mar 2025).

6. Extensions, Limitations, and Open Directions

Several extensions to hierarchical multiscale diffusion models have been proposed:

Progressive Trajectory Extension (PTE): Data augmentation by stitching short segments into long trajectories, enabling the model to extrapolate to horizons much longer than observed during training (Chen et al., 25 Mar 2025).
Recursive and parameter-shared HM-Diffuser: Conditioning a single model on scale or recursion depth, increasing efficiency and scalability (Chen et al., 25 Mar 2025).
Deep context fusion and adaptive computation: For large-scale video, context-aware feature fusion and per-level block assignment for computational efficiency are effective (Skorokhodov et al., 2024).

Limitations include sensitivity to subgoal spacing, potential compounding of artifacts across hierarchical stages, reliance on dataset coverage for effective plan stitching, and open questions on end-to-end vision-to-action integration.

Future work directions comprise adaptive or learned scale selection, improved global guidance, integration with search-based refinement (e.g., MCTS), and more robust patch merging in large-scale video applications.

HM-Diffuser methods generalize and unify several prior approaches:

Classical hierarchical planning and control (subgoal decomposition, feudal RL).
Multiresolution and pyramid architectures in vision (wavelets, Laplacian pyramids), now learned and data-driven.
Cascaded diffusion models and staged denoising, with explicit information flow or context propagation between levels.
End-to-end patch-level diffusion for video, contributing a scalable alternative to direct framewise modeling.

Collectively, these advances position HM-Diffuser as a versatile architectural motif for generative modeling where scale, structure, or temporal abstraction is essential.

Markdown Report Issue Upgrade to Chat

References (4)

MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize (2025)

Hierarchical Patch Diffusion Models for High-Resolution Video Generation (2024)

Simple Hierarchical Planning with Diffusion (2024)

Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Multiscale Diffuser (HM-Diffuser).

Hierarchical Multiscale Diffuser

1. Conceptual Foundations and Motivation

2. Mathematical Formulation and Hierarchical Structure

Image Synthesis (Two-Level Example):

Planning and Trajectory Generation:

Video Generation (Multiscale Patch Diffusion):

3. Architectural Variants and Implementation

4. Training Objectives and Sampling Algorithms

5. Empirical Performance and Comparative Insights

ImageNet Synthesis (Xu et al., 23 Jan 2025):

Planning and Control (Chen et al., 2024, Chen et al., 25 Mar 2025):

High-Resolution Video (Skorokhodov et al., 2024):

6. Extensions, Limitations, and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hierarchical Multiscale Diffuser

1. Conceptual Foundations and Motivation

2. Mathematical Formulation and Hierarchical Structure

Image Synthesis (Two-Level Example):

Planning and Trajectory Generation:

Video Generation (Multiscale Patch Diffusion):

3. Architectural Variants and Implementation

4. Training Objectives and Sampling Algorithms

5. Empirical Performance and Comparative Insights

ImageNet Synthesis (Xu et al., 23 Jan 2025):

Planning and Control (Chen et al., 2024, Chen et al., 25 Mar 2025):

High-Resolution Video (Skorokhodov et al., 2024):

6. Extensions, Limitations, and Open Directions

7. Connections to Related Paradigms

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research