Triply-Hierarchical Diffusion Policy (H3DP)

Updated 9 March 2026

The paper presents a novel visuomotor framework that couples depth-aware input layering, multi-scale feature extraction, and a hierarchically conditioned diffusion process for action generation, achieving a +27.5% improvement over baselines.
The methodology employs a three-level hierarchical design that coordinates visual perception and action control, leading to enhanced semantic alignment and a +32.3% success rate boost in challenging bimanual tasks.
Experimental results demonstrate significant advancements with a 75.6% average simulation success rate and robust real-world performance in dual-arm manipulation settings.

Triply-Hierarchical Diffusion Policy (H $^{3}$ DP) is a visuomotor learning framework for robotic manipulation that couples perception and action through a three-level hierarchy: (1) depth-aware input layering of RGB-D images, (2) multi-scale visual feature representations, and (3) a hierarchically conditioned diffusion process for action generation. H $^{3}$ DP is designed to strengthen semantic alignment between visual features and action outputs by explicitly structuring perception and control in a coordinated fashion. This architecture achieves significant relative improvements in manipulation success rates over previous diffusion policy baselines in both simulation and real-world dual-arm settings, including a +27.5% average relative improvement across 44 simulated benchmarks and +32.3% on challenging bimanual real-world tasks (Lu et al., 12 May 2025).

1. Hierarchical Architecture

H $^{3}$ DP introduces a triple-hierarchy coupling visual input and action generation:

Depth-Aware Input Layering: RGB-D observations are split into $N+1$ discrete layers along the depth axis, stratifying the scene into foreground and background sub-images. Each pixel’s depth $d$ is mapped to a layer $m \in \{0, \ldots, N\}$ using the partition:

$m = \left\lfloor -0.5 + 0.5 \sqrt{1 + 4(N+1)(N+2)\frac{d - d_{\min}}{d_{\max} - d_{\min} + \epsilon}} \right\rfloor \tag{1}$

This operation creates $N+1$ RGB-D images $\{I_m\}_{m=0}^{N}$ , spatially organizing the visual input by geometric structure.

Multi-Scale Visual Representations: Each depth-layer image $I_m$ is independently encoded at $K$ spatial resolutions. Let $f_{m,k}\in\mathbb{R}^{h_k\times w_k\times C}$ denote the feature map for layer $m$ and scale $k$ . Feature vectors are quantized using a learnable $V$ -entry codebook $\mathcal{Z}_m$ ,

$f_{m,k}^{(i,j)} \leftarrow \arg\min_{z\in\mathcal{Z}_m} \|f_{m,k}^{(i,j)}-z\|_2 \tag{2}$

Multiscale features are bilinearly upsampled to a common base size and convolved for cross-scale consistency:

$\hat f_{m,k} = \sum_{k'=1}^k \phi_{m,k'}\left(\mathrm{interp}(f_{m,k'},h_K,w_K)\right)$

The encoder is trained via a bidirectional consistency loss ensuring scale-aligned representations:

$\mathcal{L}_{\rm consistency} = \sum_{m=0}^N \sum_{k=1}^K \left\{ \|\hat f_{m,k}-\mathrm{sg}(f_m)\|_2^2 + \beta \|f_m-\mathrm{sg}(\hat f_{m,k})\|_2^2 \right\} \tag{3}$

Hierarchically Conditioned Diffusion Process: Policy generation uses a denoising diffusion model of length $T$ . At diffusion step $t$ , action $a^t\in\mathbb{R}^D$ is produced via a forward process with cosine-scheduled noise and a reverse process hierarchically conditioned on the multi-scale features. The reverse (denoising) process partitions timesteps into $K$ intervals; at $t$ in interval $(\tau_{k-1},\tau_k]$ , the policy conditions on features $\hat f_k=\{\hat f_{m,k}\}_{m=0}^N$ :

$\hat \epsilon = \epsilon_\theta^{(t)}\left(a^t \mid \hat f_k, q\right)$

$a^{t-1}$ is computed via a standard DPM update. The overall joint action distribution is:

$p_\theta(a_{0:T}\mid F) = p(a_T)\prod_{t=T}^1 p_\theta\left(a_{t-1}\mid a_t, F^{(c_t)}, q\right) \tag{4}$

2. Training Objectives and Optimization

The optimization objective jointly trains the visual encoder, codebooks, and diffusion denoiser network:

$\mathcal{L} = \mathcal{L}_{\rm diffusion} + \alpha_{\rm cons} \mathcal{L}_{\rm consistency}$

with the denoising loss given by standard score matching:

$\mathcal{L}_{\rm diffusion} = \mathbb{E}_{a^0,\;\epsilon\sim\mathcal{N}(0,I),\;t} \left[ \gamma_t\,\left\|\epsilon_\theta^{(t)}\left(\sqrt{\alpha_t}a^0 + \sqrt{1-\alpha_t}\epsilon \mid \hat f_K,q\right) - \epsilon \right\|_2^2 \right] \tag{5}$

Key hyperparameters include diffusion steps $T\in\{50,100\}$ (inference: $T_{\rm inf}\approx20$ ), depth layers $N\in\{3,4\}$ , $K=4$ feature scales at spatial resolutions $(1,1),\ (3,3),\ (5,5),\ (7,7)$ , AdamW optimizer with learning rate $10^{-4}$ , batch size $128$, cosine learning rate scheduling, and weight decay $10^{-6}$ .

3. Experimental Evaluation

Simulated Benchmarks: H $^{3}$ DP is evaluated across 44 simulated manipulation tasks spanning MetaWorld (Medium 11, Hard 5, Hard++ 5), ManiSkill (4 deformable, 4 rigid), Adroit (3), DexArt (4), and RoboTwin (8). Baselines include Diffusion Policy (DP), DP with depth, and DP3 (point-cloud). The key metric is success rate, averaged over 3 seeds and 20 evaluation episodes per seed.

Method	DP	DP (w/ depth)	DP3	H $^{3}$ DP
Avg. Success (%)	48.1	52.8	59.3	75.6

H $^3$ DP achieves a +27.5% relative improvement over DP3.

Real-World Evaluation: On the Galaxea R1 dual-arm robot, four long-horizon manipulation tasks are used (20 trials each). H $^3$ DP outperforms DP by +32.3% average:

Task	DP	H $^{3}$ DP
CF	13	51
PJ	6	47
PB	38	52
ST	57	65
Avg.	28.5	51.3

Instance-level generalization (e.g., on Place Bottle and Sweep Trash with varying objects) yields a +15.4% improvement. Success rates indicate robust transfer to real-world settings (Lu et al., 12 May 2025).

4. Component Ablations and Analysis

Ablations measure the impact of each hierarchical module:

Component removed	Avg. success (%)
w/o depth layering	46.5
w/o multi-scale repr.	48.7
w/o hierarchical act.	49.0
Full H $^{3}$ DP	59.6

Performance degrades substantially when any component is removed, indicating all three hierarchical levels contribute significantly. Choice of depth-layer count ( $N=3$ or $4$) is optimal; over-partitioning ( $N>4$ ) reduces performance but remains above the non-layered baseline. Spectral analysis on action trajectories shows low-frequency components are recovered in early denoising steps (coarse), while high-frequency details are added later—empirically supporting the coarse-to-fine generation design.

5. Implementation and Practical Considerations

The visual encoder uses a VQGAN-style convolutional network per depth layer with codebook size $V=512$ and channel dimension $C=128$ . The denoiser $\epsilon_\theta$ is a U-Net architecture with approximately $2$ million parameters, jointly processing the noisy action $a^t$ , visual feature $\hat f_k$ , and MLP-proprioceptive embedding $q$ . Training utilizes 4 $\times$ NVIDIA V100 GPUs for approximately 1 million steps per benchmark, with wall times around 48 hours. Inference achieves $\approx$ 12 fps in simulation and $\approx$ 24 fps in real-world deployment (using asynchronous action queueing).

Runtime strategies for real hardware include asynchronous inference-execution decoupling, temporal ensembling of action sequences, and "p-masking" during early training to promote vision-dependent control.

6. Context and Broader Significance

H $^{3}$ DP advances visuomotor policy learning by explicitly integrating depth geometry, semantic multi-scale vision, and hierarchical action generation. The structured coupling is empirically validated to outperform depth-concatenated and point-cloud counterparts in both simulation and physical bimanual manipulation, with substantial generalization to unseen object configurations. The design and results support stronger integration of hierarchical representations and generative modeling in robotic policy architectures (Lu et al., 12 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Triply-Hierarchical Diffusion Policy (H$^{3}$DP).