Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Multiscale Diffuser

Updated 13 April 2026
  • HM-Diffuser is a generative model that uses a hierarchical, multilevel diffusion process to capture both coarse and fine details in structured data.
  • It applies to image synthesis, trajectory planning, and video generation, leveraging domain-specific hierarchies like spatial patches and temporal subgoals.
  • Empirical results demonstrate improved inference speed, superior sample quality, and robust out-of-distribution performance compared to single-scale diffusion models.

The Hierarchical Multiscale Diffuser (HM-Diffuser) is a class of generative models that employs a multilevel, scale-adaptive diffusion process for more efficient modeling of high-dimensional, structured data. This paradigm has been instantiated across generative image synthesis, trajectory modeling in reinforcement learning, and high-resolution video generation. Central to HM-Diffuser is the decomposition of the generative process into hierarchical stages, each responsible for progressively finer-grained aspects of the target signal, thereby improving computational efficiency, generalization, and sample quality.

1. Conceptual Foundations and Motivation

HM-Diffuser architectures generalize classical diffusion models by introducing explicit hierarchy in the denoising process. Rather than iteratively removing noise from an entire high-dimensional signal in a flat, single-scale manner, the HM-Diffuser partitions the generative process:

This approach draws conceptual parallels to multiresolution analysis techniques in signal processing (e.g., wavelets), but uses deep generative diffusion processes conditioned at each hierarchical stage.

2. Mathematical Formulation and Hierarchical Structure

The HM-Diffuser decomposes the generation target into a set of levels or scales. At each scale â„“\ell (or stage ii), a diffusion model is trained on a lower-dimensional representation or a sparser abstraction of the data, then refined at the next finer level. The mathematical structure varies by application domain:

Image Synthesis (Two-Level Example):

  • Let x∈RH×W×3\mathbf{x}\in\mathbb{R}^{H\times W \times 3} be an input image, and E\mathcal{E} a VAE encoder.
  • The latent representation is factorized into a base (zbasez_\mathrm{base}, low-frequency) and a residual (zresz_\mathrm{res}, high-frequency) component.
  • For each, a continuous-time rectified flow loss is minimized:

Li=∫01E[∥freshi−z0−vθ(zt,t∣⋯ )∥2]dt,\mathcal L_i = \int_{0}^{1} \mathbb{E}\bigl[\|f_\mathrm{res}^{h_i} - z_0 - v_\theta(z_t, t \mid \cdots)\|^2\bigr] dt,

summed across levels ii (Xu et al., 23 Jan 2025).

Planning and Trajectory Generation:

  • At the coarsest temporal scale, a high-level diffusion model generates a sparse sequence of subgoals over long horizons.
  • At each interval between subgoals, a lower-level diffusion model generates the fine-grained segments, conditioning on the start/end subgoals.
  • The joint process is formalized as hierarchical denoising within the Denoising Diffusion Probabilistic Model (DDPM) framework, with per-level conditional distributions

pθℓ(τℓ∣g0ℓ,gkℓℓ),p_{\theta_\ell}(\tau^\ell \mid g_0^\ell, g_{k_\ell}^\ell),

where τℓ\tau^\ell is the subtrajectory at level ii0 with endpoints fixed (Chen et al., 25 Mar 2025, Chen et al., 2024).

Video Generation (Multiscale Patch Diffusion):

  • The video is never processed as a whole. Instead, for ii1 scales, at each scale ii2 the process operates on patches ii3 of increasingly higher resolution.
  • Forward and reverse processes are parameterized to diffuse and denoise within and across levels, with tight coupling via context fusion mechanisms (Skorokhodov et al., 2024).

3. Architectural Variants and Implementation

Implementations of HM-Diffuser vary with domain and target signal:

  • Image domain: Two-stage transformer backbones are commonly used, with the base model operating on a downsampled (coarse) latent and the residual model refining at higher resolution. Each model can use self-attention and class-conditioning. Notably, residuals are empirically easier to model, allowing substantial reduction in inference steps for the refinement stage (Xu et al., 23 Jan 2025).
  • Trajectory and planning: Distinct high-level (sparse, subgoal) and low-level (dense, short-segment) diffusers are trained. Hierarchies can be encoded recursively or with explicit parameter sharing. Plan refinement uses parallel or sequential calls to the low-level diffusers for each coarse interval (Chen et al., 25 Mar 2025, Chen et al., 2024).
  • Video synthesis: At each patch scale, the model includes transformer/U-Net blocks with cross-scale context fusion. Adaptive computation dynamically assigns more resources to coarser levels, exploiting the relative simplicity of fine-detail synthesis. Patches are aligned using coordinate-aware feature propagation (Skorokhodov et al., 2024).

A summary of variant architectures:

Domain Hierarchy Type Backbone Conditioning
Image Base/Residual Transformer Class tokens
RL/Planning Temporal (subgoals) Transformer Return predictor, subgoals
Video Patch hierarchy RIN+Transf. Deep feature context

4. Training Objectives and Sampling Algorithms

Each HM-Diffuser model is trained with per-level denoising objectives, typically leveraging mean-squared error on noise prediction with appropriately scheduled perturbation:

  • Image/Latent: Continuous or discrete-time diffusion losses for each decomposition stage, with parameters trained either independently or in a staged/tied fashion.
  • Planning: Joint training of high- and low-level diffusers, with subgoal conditioning. Return predictors are included for guidance, and the entire plan is composed by concatenating subtrajectories generated at each hierarchical tier (Chen et al., 2024).
  • Video: Losses are minimized per patch scale with context fusion; total loss sums denoising reconstruction across all scales, with schedule-aligned noise variance (Skorokhodov et al., 2024).

Sampling typically follows a coarse-to-fine paradigm. At inference, each model:

  1. Generates a coarse (low-frequency/long-horizon) solution.
  2. Successively refines the output using finer-level diffusers, conditioned on appropriate context (upscaling, subgoals, feature fusion).
  3. Decodes or stitches outputs to obtain the final high-resolution or long-horizon artifact (Xu et al., 23 Jan 2025, Chen et al., 25 Mar 2025, Skorokhodov et al., 2024).

5. Empirical Performance and Comparative Insights

Extensive empirical benchmarking confirms substantial efficiency and quality advantages of HM-Diffuser frameworks compared to flat, single-level diffusion models.

Model Params FID ↓ IS ↑ Time (ms)
DiT-XL/2 675 M 2.27 278.2 145
HM-Diffuser (MSF) 767 M 2.20 254.7 68
  • HM-Diffuser achieves ii42.1× faster inference at matched or better FID (2.20 at 68ms).
  • Maze2D Medium: Diffuser ii5121, HM-Diffuser 136 (+12%).
  • AntMaze Large: Diffuser ii6, HM-Diffuser ii7.
  • Planning speed: single-scale Diffuser ii810s/plan; HM-Diffuser ii91s/plan (x∈RH×W×3\mathbf{x}\in\mathbb{R}^{H\times W \times 3}0 speedup).
  • Multi-task compositional generalization: HM-Diffuser achieves x∈RH×W×3\mathbf{x}\in\mathbb{R}^{H\times W \times 3}1 OOD success where all flat Diffusers fail.
Model FVD ↓ IS ↑
PVDM (’23) 343.6 74.4
HPDM-L 66.32 87.68
  • HPDM achieves x∈RH×W×3\mathbf{x}\in\mathbb{R}^{H\times W \times 3}2 FVD improvement over prior patch or latent diffusion models on UCF-101 x∈RH×W×3\mathbf{x}\in\mathbb{R}^{H\times W \times 3}3.

Empirically, residual/fine-grained levels require fewer denoising steps, permitting faster generation. Hierarchical structure yields superior generalization, especially to long-horizon or OOD scenarios (Chen et al., 25 Mar 2025).

6. Extensions, Limitations, and Open Directions

Several extensions to hierarchical multiscale diffusion models have been proposed:

  • Progressive Trajectory Extension (PTE): Data augmentation by stitching short segments into long trajectories, enabling the model to extrapolate to horizons much longer than observed during training (Chen et al., 25 Mar 2025).
  • Recursive and parameter-shared HM-Diffuser: Conditioning a single model on scale or recursion depth, increasing efficiency and scalability (Chen et al., 25 Mar 2025).
  • Deep context fusion and adaptive computation: For large-scale video, context-aware feature fusion and per-level block assignment for computational efficiency are effective (Skorokhodov et al., 2024).

Limitations include sensitivity to subgoal spacing, potential compounding of artifacts across hierarchical stages, reliance on dataset coverage for effective plan stitching, and open questions on end-to-end vision-to-action integration.

Future work directions comprise adaptive or learned scale selection, improved global guidance, integration with search-based refinement (e.g., MCTS), and more robust patch merging in large-scale video applications.

HM-Diffuser methods generalize and unify several prior approaches:

  • Classical hierarchical planning and control (subgoal decomposition, feudal RL).
  • Multiresolution and pyramid architectures in vision (wavelets, Laplacian pyramids), now learned and data-driven.
  • Cascaded diffusion models and staged denoising, with explicit information flow or context propagation between levels.
  • End-to-end patch-level diffusion for video, contributing a scalable alternative to direct framewise modeling.

Collectively, these advances position HM-Diffuser as a versatile architectural motif for generative modeling where scale, structure, or temporal abstraction is essential.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Multiscale Diffuser (HM-Diffuser).