Hierarchical Diffusion Policy

Updated 10 March 2026

Hierarchical Diffusion Policy is a structured approach that decomposes sequential decision-making into high- and low-level processes using conditional diffusion models for subgoal and action generation.
It enhances long-horizon planning and sample efficiency by employing Gaussian denoising, GP prior regularization, and hybrid exploration-exploitation methods.
Empirical studies demonstrate significant performance improvements in robotics and RL benchmarks, highlighting its robust generalization and interpretability.

Hierarchical Diffusion Policy (HDP) refers to a family of policy architectures in sequential decision-making and control that decompose planning or control into multiple levels of temporal or semantic abstraction, with each level represented by (conditional) diffusion models. At each hierarchical level, diffusion models capture multimodal distributions (over subgoals, skills, action chunks, or contact/semantic states), enabling expressive, flexible, and sample-efficient long-horizon reasoning. Contemporary hierarchical diffusion policy variants span reinforcement learning, imitation learning, offline RL, and robotics domains, integrating uncertainty quantification, structured priors, hybrid generative-planning, and multi-scale representations.

1. Core Hierarchical Diffusion Policy Architecture

Hierarchical Diffusion Policy architectures universally adopt layered decomposition:

High-level (“planner”, “manager”, “guider”): generates temporally abstract subgoals or semantic chunks, typically every $k$ steps. The high-level policy is modeled as a conditional denoising diffusion process, producing a distribution over subgoals $g$ or intermediate states given the current context $s$ or observation $o$ .
Low-level (“controller”, “skill” or “actor”): conditioned on the current subgoal, produces primitive action sequences or skill executions, also often realized as a conditional diffusion policy.

For example, in HIDI (Wang et al., 27 May 2025), the high-level policy $\pi_h(g | s)$ emits a subgoal $g \in \mathcal{G} \subseteq \mathcal{S}$ every $k$ steps, and the low-level policy $\pi_\ell(a|s,g)$ executes action $a \in \mathcal{A}$ toward $g$ , receiving intrinsic reward $r_t^i = -\|s_{t+1} - g\|^2$ . Both levels may leverage conditional diffusion models for their denoising/reconstruction process, ensuring multimodal subgoal (and action) distributions with tractable ELBO or score-matching training.

This two-level paradigm is shared and extended by DuSkill (Kim et al., 2024) (offline RL with skill diffusion), Query-Centric Diffusion Policy (query over assembly skills) (Xu et al., 23 Sep 2025), and HDP for kinematics-aware multi-task robotic manipulation (Ma et al., 2024), among others.

2. Diffusion Modeling, Training, and Regularization

Each hierarchical level utilizes a denoising diffusion probabilistic model (DDPM), defined by:

Forward process (noising): Sequentially applies Gaussian noise to the clean target (subgoal, skill, action, or contact state) over $N$ steps, e.g. $q(g^{1:N}|g^0) = \prod_{i=1}^N \mathcal{N}(g^i; \sqrt{1-\beta_i}g^{i-1}, \beta_i I)$ .
Reverse process (denoising): Parameterizes the conditional distribution with neural denoising net $\epsilon_\theta$ that reconstructs the original target from noisy inputs (cf. equations (1–3) in (Wang et al., 27 May 2025)).

Training loss typically follows score-matching [Ho et al. 2020]-style $\ell_2$ loss,

$L_{dm}(\theta) = \mathbb{E}_{(s,g^0), i, \epsilon} \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_i}g^0 + \sqrt{1-\bar{\alpha}_i}\epsilon, s, i)\|^2,$

with $\bar{\alpha}_i = \prod_{j=1}^i (1-\beta_j)$ . When used in offline learning or imitation, denoising is supervised against relabeled states or expert trajectories; in RL, it is additionally regularized or updated via RL-based objectives.

HIDI further introduces GP prior regularization: a high-level GP prior $p_{GP}(g|s)$ with RBF kernel conditions the diffusion model both during training (cross-entropy loss vs. GP prediction) and at inference (hybrid selection between GP mean and diffusion sample), enforcing uncertainty-awareness and contraction toward high-confidence subgoals [(Wang et al., 27 May 2025), eqs. 9–15].

DuSkill (Kim et al., 2024) and H³DP (Lu et al., 12 May 2025) extend this with multi-head or multi-scale conditioning, classifier-free guidance, and domain-variant/domain-invariant latent separation.

3. Hierarchy Construction, Conditioning, and Adaptivity

Hierarchical Diffusion Policies feature advanced mechanisms for defining the structure and conditioning of each layer:

Explicit Two-Level Hierarchy: Most implementations (e.g., HIDI (Wang et al., 27 May 2025), HeRD (Caro et al., 10 Dec 2025), HDP (Wang et al., 2024)) split subgoal and action layers at a single timescale.
Structural/Adaptive Hierarchies: SIHD (Zeng et al., 26 Sep 2025) adaptively partitions the trajectory hierarchy by maximizing structural entropy on a nearest-neighbor state graph, using community detection over a $k$ NN graph to set the number and scales of hierarchy layers, and conditioning each diffusion layer on structural information gain.
Semantic/Skill Hierarchies: QDP (Xu et al., 23 Sep 2025) and VLM-Diffusion (Peschl et al., 29 Sep 2025) utilize semantic decomposition (e.g., object-skill-contact queries, API code), controlling the hierarchy through vision-LLMs (VLMs) or query selection modules.
Multi-Scale Visual Hierarchies: H³DP (Lu et al., 12 May 2025) applies depth-aware input layering, multi-scale representation, and coarse-to-fine diffusion stagewise conditioning, each scale targeting different spatiotemporal frequency bands in action-space.

Conditioning mechanisms are correspondingly diverse: classifier-free guidance, semantic mask/latent-based injection, or point-cloud and proprioceptive context. Snapshots, inpainting, and prompt-guided selection are employed to improve sampling efficiency and controllability (Wang et al., 2024, Lu et al., 12 May 2025).

4. Policy Inference, Subgoal Selection, and Grounding

Policy execution proceeds as follows:

The high-level diffusion module samples (or deterministically selects, e.g., via GP mean) a subgoal or semantic plan, using context (state, image, text, or task code).
The low-level controller, typically a separate conditional diffusion process, generates a trajectory segment or action chunk attempting to reach the subgoal.
Execution can be open-loop over chunk duration, with optional replanning if task success is not achieved.

Hybrid selection strategies, such as in HIDI (Wang et al., 27 May 2025), stochastically interpolate between deterministic GP mean and learned diffusion samples to balance exploration (via sampling) and exploitation (via mean), with theoretically bounded regret and non-decreasing improvement guarantees. In offline RL (SIHD (Zeng et al., 26 Sep 2025)) and skill composition (DuSkill (Kim et al., 2024)), hierarchically pre-trained/frozen decoders support transfer and adaptability.

Recent approaches tackle planner-controller mismatch via iterative on-policy refinement (HD-ExpIt (Grislain et al., 5 Mar 2026)), performing supervised distillation of successful rollouts back into hierarchical diffusion models, thus aligning subgoal distribution with controller feasibility.

Prompt guidance and open-loop subgoal override enable direct, fine-grained user or module control over the planning process, enhancing interpretability and recoverability (Wang et al., 2024).

5. Empirical Results and Performance Characteristics

Hierarchical Diffusion Policy variants demonstrate state-of-the-art or highly competitive performance across a variety of domains:

Goal-conditioned continuous control (HIDI (Wang et al., 27 May 2025)): Outperforms HIRO, HRAC, HIGL, SAGA, HLPS in success rate and sample efficiency on MuJoCo Reacher, Pusher, Point-Maze, Ant Maze (including sparse and image-based variants). Ablations reveal significant degradation ( $\sim$ 15–20 points) if diffusion or GP regularization is removed.
Skill transfer/robustness (DuSkill (Kim et al., 2024)): Under domain shifts, DuSkill sustains only 7.6% reward drop vs. 25% for best baseline and attains 89% higher reward under online RL adaptation compared to SPiRL-c.
Contact-rich and assembly robotics (HDP (Wang et al., 2024), QDP (Xu et al., 23 Sep 2025)): Subgoal-guided hierarchical diffusion policies achieve $\sim$ +20% absolute improvement over flat Diffusion Policy and $>$ 50% improvement in challenging assembly skills (insert/screw) versus baselines without query conditioning.
Kinematics-aware manipulation (HDP+RK-Diffuser (Ma et al., 2024)): Achieves 80.2% overall RLBench success, with significant gains over ACT and vanilla Diffusion Policy. The pose-joint coupling via Jacobian loss enables higher accuracy and full kinematics compliance.
Long-horizon planning and feedback coupling (CHD (Hao et al., 12 May 2025), SIHD (Zeng et al., 26 Sep 2025)): CHD yields higher reward and faster inference versus both uncoupled (SHD, BHD) and flat baselines, while SIHD shows greater trajectory diversity and robustness in offline RL benchmarks, especially in sparse-reward and multi-task settings.
Triply-hierarchical approaches (H³DP (Lu et al., 12 May 2025)): Deliver +27.5% relative success rate gain over best prior diffusion baseline across 44 simulated tasks and +32% in real-world bimanual robotics, with robust generalization to out-of-distribution objects.

Empirical analyses reveal that hierarchical diffusion orchestrates coarse-to-fine planning, enforces subgoal feasibility, and improves interpretability and sample efficiency.

6. Theoretical Properties, Limitations, and Open Challenges

Hierarchical Diffusion Policies are supported by guarantees on sample-based regret (e.g., Theorems 3.3, 3.4 in (Wang et al., 27 May 2025)), variational bound tightness (CHD, (Hao et al., 12 May 2025)), and subgoal alignment via feedback or hybrid selection mechanisms.

Common limitations include:

Hyperparameter tuning (diffusion steps, GP/prior scale, hierarchy depth) remains empirical; N=5 diffusion steps and GP-mean mixing $\epsilon=0.1$ are near-optimal in HIDI (Wang et al., 27 May 2025).
Most approaches rely on fixed or manually-determined hierarchy; data-driven or adaptive segmentation (as in SIHD (Zeng et al., 26 Sep 2025)) is an active research direction.
Real-world deployment may be sensitive to perception noise, unmodeled dynamics, or OOD conditions (see (Wang et al., 2024, Xu et al., 23 Sep 2025)). Safety-aware or robust diffusion planning is an open area.
Iterative on-policy refinement and hierarchical-coupled feedback (HD-ExpIt (Grislain et al., 5 Mar 2026), CHD (Hao et al., 12 May 2025)) increase data and compute requirements.

Despite these, the demonstrated capacity for flexible multimodal abstraction, sample-efficient skill generation, and robust planning across complex domains establishes Hierarchical Diffusion Policy as a foundational paradigm in modern sequential decision-making research.