Hierarchical Diffusion Policy (HDP)

Updated 9 March 2026

HDP is a decision-making framework that decomposes planning into high-level subgoal generation and low-level diffusion control to address long-horizon, multimodal tasks.
It leverages denoising diffusion models at each level to synthesize diverse, context-aware trajectories while ensuring physical feasibility and sample efficiency.
Empirical benchmarks show significant performance gains in areas like robotic manipulation and navigation, despite challenges such as error propagation and computational overhead.

A Hierarchical Diffusion Policy (HDP) is a class of decision-making architecture that integrates the temporal abstraction and modularity of hierarchical control with the expressive generative capabilities of denoising diffusion models. Designed to address long-horizon, multimodal tasks in domains such as robotic manipulation and reinforcement learning, HDP decomposes policy learning and trajectory generation into multiple interacting levels, typically separating high-level planning (e.g., task decomposition, subgoal or keypoint generation) from low-level control (e.g., trajectory synthesis, skill execution). Each policy level typically leverages denoising diffusion probabilistic models (DDPMs), often conditioned on upstream subgoals and rich context, to model complex conditional distributions over plans and actions. This composition allows HDP to solve complex multi-stage tasks while maintaining tractability, sample efficiency, and robustness to multimodal action or observation spaces.

1. Hierarchical Policy Structure

HDP architectures consistently leverage explicit hierarchical decompositions, generally with two or three levels. The canonical two-level instantiation comprises:

High-Level Planner: Generates temporally abstracted subgoals, such as key states, end-effector poses, spatial targets, or task-structured anchors, often conditioned on observations and (optionally) task specifications such as language instructions. For example, the Next-Best-Pose (NBP) planner in robotic manipulation produces future end-effector poses and gripper commands based on scene context and instructions (Ma et al., 2024), while in contact-rich tasks, the high-level policy predicts discrete or continuous contact locations (Wang et al., 2024).
Low-Level Diffusion Controller: Produces detailed trajectories or action sequences conditioned on the high-level subgoal, current state, and additional observations. These policies use diffusion-based generative models to ensure diversity, context-awareness, and feasibility with respect to system dynamics, robot kinematics, or physical constraints (Ma et al., 2024, Caro et al., 10 Dec 2025, Lu et al., 12 May 2025). In some instantiations, the low level receives not just subgoals but also multi-scale visual features or key-anchored priors (Lu et al., 12 May 2025, Kim et al., 30 Sep 2025).

Extensions include deeper hierarchies (e.g., triply-hierarchical structures that combine depth-aware input layering, multi-scale visual encoding, and action-level diffusion (Lu et al., 12 May 2025)), as well as methods that adapt the number, granularity, and semantics of layers based on structural properties of the domain (Zeng et al., 26 Sep 2025).

2. Diffusion Models in Hierarchical Control

Diffusion models in HDP operate by learning a denoising process that maps noisy, corrupted trajectories or subgoals to clean samples conditioned on context. The standard DDPM forward process for a trajectory $x^0$ is

$q(x^k|x^{k-1}) = \mathcal{N}(x^k;\sqrt{1-\beta^k}x^{k-1}, \beta^k I),$

and reverse denoising proceeds via learned or structured kernels, with variants incorporating classifier-free or value-based guidance.

In the hierarchical setting, the high-level DDPM is trained on demonstration subgoal or keypoint sequences, with context such as multimodal observation, language, or previous subgoal (Ma et al., 2024, Grislain et al., 5 Mar 2026). The low-level DDPM may operate in action space, joint configuration space, or world frame, employing techniques such as inpainting endpoints, differentiable kinematics for physical feasibility, and guidance via value functions or Q-critics (Ma et al., 2024, Wang et al., 2024). Conditioned denoising is essential for goal-directed behavior and is often realized via cross-attention to multi-scale context or dedicated conditioning modules (Lu et al., 12 May 2025).

Variants further generalize the forward/reverse kernels with non-isotropic, task-structured Gaussian priors derived from motion planning or Gaussian process models, allowing the denoising process to follow a Mahalanobis geometry aligned with both task anchors and dynamic feasibility (Kim et al., 30 Sep 2025, Wang et al., 27 May 2025).

3. Methodological Variants and Technical Advances

Variant/Technique	High Level (Planner)	Low Level (Controller)	Distinctive Features
(Ma et al., 2024) HDP	PerAct NBP planner (BC, voxel/bins)	RK-Diffuser (joint/pose, kin. distill)	Kinematics-aware, two-chain diffusion, sim+real
(Lu et al., 12 May 2025) H $^3$ DP	Depth-layered multi-scale visual plans	Coarse-to-fine action diffusion	Triply-hierarchical, multi-scale conditioning
(Wang et al., 2024) HDP	Contact location diffusion	Action sequence diffusion, Q-learning	Contact guidance, snapshot denoising, promptable
(Zeng et al., 26 Sep 2025) SIHD	Adaptive, info-struct. keypoint splits	Multi-scale diffusion, struct. cond.	Structural entropy, adaptivity, regularization
(Kim et al., 30 Sep 2025) Hier. Diff.	Keypoint/sample structure (GPMP, etc.)	GPMP-conditioned trajectory diffusion	Uncertainty-aware non-isotropic priors
(Caro et al., 10 Dec 2025) HeRD	RL-based high-level goal selector	Diffusion-based 2D trajectory	RL/diffusion hybrid, nonprehensile pushing
(Kim et al., 2024) DuSkill	Latent skill selector (domain disent.)	Guided skill diffusion	Latent skill disentanglement, domain transfer
(Grislain et al., 5 Mar 2026) HD-ExpIt	Visual plan image diffusion (DDPM)	Open-loop chunked action diffusion	Iterative expert iteration, on-policy distill
(Wang et al., 27 May 2025)	Diffusion for subgoal generation (GP)	Off-policy RL for subgoal reaching	Uncertainty-guided hybrid subgoal selection

Notable advances across these works include:

Kinematics-aware control: Joint/pose-space dual diffusion with distillation via differentiable kinematics (Ma et al., 2024).
Multi-scale visual and action representation: Depth-aware input splitting, multi-scale visual features, and hierarchical action conditioning to align perception and control (Lu et al., 12 May 2025).
Contact-guided decomposition: High-level contact prediction directly conditions low-level action denoising, enhancing robustness in multimodal, contact-rich domains (Wang et al., 2024).
Adaptive, information-theoretic hierarchy: Subgoal abstraction and conditioning signals adaptively inferred from structural entropy of observed state graphs (Zeng et al., 26 Sep 2025), yielding flexible multi-scale policies.
Task-conditioned uncertainty-aware priors: GPMP-derived priors and non-isotropic noise models bias the denoising process toward feasible and semantically meaningful trajectories (Kim et al., 30 Sep 2025).
Hybrid RL/Diffusion: RL-based planners select or supervise subgoals for low-level generative policies (Caro et al., 10 Dec 2025, Wang et al., 27 May 2025).
Latent skill disentanglement: Hierarchically disentangled latent space for skill composition and domain adaptation via guided diffusion (Kim et al., 2024).
On-policy iterative refinement: Expert-iteration meta-algorithms (HD-ExpIt) that iteratively refine hierarchical diffusion components through supervised distillation of on-policy rollouts (Grislain et al., 5 Mar 2026).

4. Empirical Performance and Benchmarking

HDP and its variants consistently outperform flat diffusion policies, conventional HRL, and non-hierarchical baselines across a range of simulated and real-world environments.

Robotic manipulation (RLBench, Franka Panda, CAN, SQUARE, etc.): Significant boosts in task success rates; e.g., HDP achieves 80.2% overall success on RLBench (vs. DP at 71.3%) and >94% on Franka-3Blocks after a single iteration of HD-ExpIt (Ma et al., 2024, Grislain et al., 5 Mar 2026).
Contact-rich and deformable tasks: Average improvements of 20.8% over Diffusion Policy via explicit contact guidance (Wang et al., 2024).
Complex navigation and continuous control (Maze2D, AntMaze, MuJoCo): HDP variants outpace hierarchical and non-hierarchical competitors, e.g., in Maze2D-Large: Diffuser at 123.0 vs. HDP at 155.8; AntMaze-Large: Diffuser fails (0.0) vs. HDP at 83.6 (Chen et al., 2024, Zeng et al., 26 Sep 2025).
Visuomotor and real-world bimanual manipulation: Triply-hierarchical H $^3$ DP yields +27.5% average improvement over diffusion baselines across 44 tasks (Lu et al., 12 May 2025).
Robustness and sample efficiency: DuSkill (HDP for skill learning) maintains high-performance under substantial cross-domain distribution shift and in low-data regimes (Kim et al., 2024).

5. Limitations and Open Research Questions

Limitations of HDP architectures—both noted empirically and discussed theoretically—include:

Error propagation: High-level planner errors, such as suboptimal NBP or contact predictions, propagate downstream; low-level controllers struggle to recover (Ma et al., 2024, Wang et al., 2024, Grislain et al., 5 Mar 2026).
Compounding imitation error: Supervised BC, while sample-efficient, is susceptible to compounding errors over long horizons (Ma et al., 2024).
Data reliance and coverage: Out-of-distribution states, rare contact modalities, or unrepresented domains challenge performance and generalization, especially for demonstration-driven learning (Wang et al., 2024, Kim et al., 2024).
Computational complexity: Multi-level diffusion and iterative refinement introduce significant computational and inference overhead, mitigated partially via snapshot denoising or speed-optimized schedulers (Wang et al., 2024, Grislain et al., 5 Mar 2026).
Hybrid and modularity issues: Tighter coupling and joint optimization of hierarchy levels, or hybridization with RL paradigms, remain research targets (Ma et al., 2024, Grislain et al., 5 Mar 2026).

Future directions prioritize integrated RL-based fine-tuning, end-to-end joint optimization, adaptive horizon planning, broader task-structure induction (e.g., Vision-LLM integration), and sim-to-real transfer with safety and robustness constraints (Ma et al., 2024, Grislain et al., 5 Mar 2026, Wang et al., 2024).

6. Theoretical and Practical Impact

The development of HDP marks a shift toward generative, compositional, and structurally-informed control policies in sequential decision-making. The fusion of denoising diffusion models with hierarchical policy paradigms offers:

Direct modeling of highly multimodal and complex distributions over subgoals, plans, and skills, overcoming mode-collapse and representational shortcomings of adversarial or retrieval-based skill learning (Kim et al., 2024, Wang et al., 27 May 2025).
Tractable integration of task and motion priors, physical constraints, and context (e.g., object geometry, language) via structured conditioning and uncertainty-aware noise models (Kim et al., 30 Sep 2025, Lu et al., 12 May 2025).
Empirical gains in sample efficiency, generalization, robustness, and interpretability across domains as diverse as robot manipulation, navigation, and skill-based RL.

The broad applicability and modular extensibility of HDPs suggest continued relevance in domains where generative trajectory abstraction, structural composition, and hierarchical intent modeling are essential. The field is rapidly evolving toward integrated architectures combining learned structure, generative stochastic search, and hierarchical planning (Ma et al., 2024, Zeng et al., 26 Sep 2025, Kim et al., 30 Sep 2025, Wang et al., 2024).