Papers
Topics
Authors
Recent
Search
2000 character limit reached

Triply-Hierarchical Diffusion Policy (H3DP)

Updated 9 March 2026
  • The paper presents a novel visuomotor framework that couples depth-aware input layering, multi-scale feature extraction, and a hierarchically conditioned diffusion process for action generation, achieving a +27.5% improvement over baselines.
  • The methodology employs a three-level hierarchical design that coordinates visual perception and action control, leading to enhanced semantic alignment and a +32.3% success rate boost in challenging bimanual tasks.
  • Experimental results demonstrate significant advancements with a 75.6% average simulation success rate and robust real-world performance in dual-arm manipulation settings.

Triply-Hierarchical Diffusion Policy (H3^{3}DP) is a visuomotor learning framework for robotic manipulation that couples perception and action through a three-level hierarchy: (1) depth-aware input layering of RGB-D images, (2) multi-scale visual feature representations, and (3) a hierarchically conditioned diffusion process for action generation. H3^{3}DP is designed to strengthen semantic alignment between visual features and action outputs by explicitly structuring perception and control in a coordinated fashion. This architecture achieves significant relative improvements in manipulation success rates over previous diffusion policy baselines in both simulation and real-world dual-arm settings, including a +27.5% average relative improvement across 44 simulated benchmarks and +32.3% on challenging bimanual real-world tasks (Lu et al., 12 May 2025).

1. Hierarchical Architecture

H3^{3}DP introduces a triple-hierarchy coupling visual input and action generation:

  1. Depth-Aware Input Layering: RGB-D observations are split into N+1N+1 discrete layers along the depth axis, stratifying the scene into foreground and background sub-images. Each pixel’s depth dd is mapped to a layer m{0,,N}m \in \{0, \ldots, N\} using the partition:

m=0.5+0.51+4(N+1)(N+2)ddmindmaxdmin+ϵ(1)m = \left\lfloor -0.5 + 0.5 \sqrt{1 + 4(N+1)(N+2)\frac{d - d_{\min}}{d_{\max} - d_{\min} + \epsilon}} \right\rfloor \tag{1}

This operation creates N+1N+1 RGB-D images {Im}m=0N\{I_m\}_{m=0}^{N}, spatially organizing the visual input by geometric structure.

  1. Multi-Scale Visual Representations: Each depth-layer image ImI_m is independently encoded at KK spatial resolutions. Let fm,kRhk×wk×Cf_{m,k}\in\mathbb{R}^{h_k\times w_k\times C} denote the feature map for layer mm and scale kk. Feature vectors are quantized using a learnable VV-entry codebook Zm\mathcal{Z}_m,

fm,k(i,j)argminzZmfm,k(i,j)z2(2)f_{m,k}^{(i,j)} \leftarrow \arg\min_{z\in\mathcal{Z}_m} \|f_{m,k}^{(i,j)}-z\|_2 \tag{2}

Multiscale features are bilinearly upsampled to a common base size and convolved for cross-scale consistency:

f^m,k=k=1kϕm,k(interp(fm,k,hK,wK))\hat f_{m,k} = \sum_{k'=1}^k \phi_{m,k'}\left(\mathrm{interp}(f_{m,k'},h_K,w_K)\right)

The encoder is trained via a bidirectional consistency loss ensuring scale-aligned representations:

Lconsistency=m=0Nk=1K{f^m,ksg(fm)22+βfmsg(f^m,k)22}(3)\mathcal{L}_{\rm consistency} = \sum_{m=0}^N \sum_{k=1}^K \left\{ \|\hat f_{m,k}-\mathrm{sg}(f_m)\|_2^2 + \beta \|f_m-\mathrm{sg}(\hat f_{m,k})\|_2^2 \right\} \tag{3}

  1. Hierarchically Conditioned Diffusion Process: Policy generation uses a denoising diffusion model of length TT. At diffusion step tt, action atRDa^t\in\mathbb{R}^D is produced via a forward process with cosine-scheduled noise and a reverse process hierarchically conditioned on the multi-scale features. The reverse (denoising) process partitions timesteps into KK intervals; at tt in interval (τk1,τk](\tau_{k-1},\tau_k], the policy conditions on features f^k={f^m,k}m=0N\hat f_k=\{\hat f_{m,k}\}_{m=0}^N:

ϵ^=ϵθ(t)(atf^k,q)\hat \epsilon = \epsilon_\theta^{(t)}\left(a^t \mid \hat f_k, q\right)

at1a^{t-1} is computed via a standard DPM update. The overall joint action distribution is:

pθ(a0:TF)=p(aT)t=T1pθ(at1at,F(ct),q)(4)p_\theta(a_{0:T}\mid F) = p(a_T)\prod_{t=T}^1 p_\theta\left(a_{t-1}\mid a_t, F^{(c_t)}, q\right) \tag{4}

2. Training Objectives and Optimization

The optimization objective jointly trains the visual encoder, codebooks, and diffusion denoiser network:

L=Ldiffusion+αconsLconsistency\mathcal{L} = \mathcal{L}_{\rm diffusion} + \alpha_{\rm cons} \mathcal{L}_{\rm consistency}

with the denoising loss given by standard score matching:

Ldiffusion=Ea0,  ϵN(0,I),  t[γtϵθ(t)(αta0+1αtϵf^K,q)ϵ22](5)\mathcal{L}_{\rm diffusion} = \mathbb{E}_{a^0,\;\epsilon\sim\mathcal{N}(0,I),\;t} \left[ \gamma_t\,\left\|\epsilon_\theta^{(t)}\left(\sqrt{\alpha_t}a^0 + \sqrt{1-\alpha_t}\epsilon \mid \hat f_K,q\right) - \epsilon \right\|_2^2 \right] \tag{5}

Key hyperparameters include diffusion steps T{50,100}T\in\{50,100\} (inference: Tinf20T_{\rm inf}\approx20), depth layers N{3,4}N\in\{3,4\}, K=4K=4 feature scales at spatial resolutions (1,1), (3,3), (5,5), (7,7)(1,1),\ (3,3),\ (5,5),\ (7,7), AdamW optimizer with learning rate 10410^{-4}, batch size $128$, cosine learning rate scheduling, and weight decay 10610^{-6}.

3. Experimental Evaluation

Simulated Benchmarks: H3^{3}DP is evaluated across 44 simulated manipulation tasks spanning MetaWorld (Medium 11, Hard 5, Hard++ 5), ManiSkill (4 deformable, 4 rigid), Adroit (3), DexArt (4), and RoboTwin (8). Baselines include Diffusion Policy (DP), DP with depth, and DP3 (point-cloud). The key metric is success rate, averaged over 3 seeds and 20 evaluation episodes per seed.

Method DP DP (w/ depth) DP3 H3^{3}DP
Avg. Success (%) 48.1 52.8 59.3 75.6

H3^3DP achieves a +27.5% relative improvement over DP3.

Real-World Evaluation: On the Galaxea R1 dual-arm robot, four long-horizon manipulation tasks are used (20 trials each). H3^3DP outperforms DP by +32.3% average:

Task DP H3^{3}DP
CF 13 51
PJ 6 47
PB 38 52
ST 57 65
Avg. 28.5 51.3

Instance-level generalization (e.g., on Place Bottle and Sweep Trash with varying objects) yields a +15.4% improvement. Success rates indicate robust transfer to real-world settings (Lu et al., 12 May 2025).

4. Component Ablations and Analysis

Ablations measure the impact of each hierarchical module:

Component removed Avg. success (%)
w/o depth layering 46.5
w/o multi-scale repr. 48.7
w/o hierarchical act. 49.0
Full H3^{3}DP 59.6

Performance degrades substantially when any component is removed, indicating all three hierarchical levels contribute significantly. Choice of depth-layer count (N=3N=3 or $4$) is optimal; over-partitioning (N>4N>4) reduces performance but remains above the non-layered baseline. Spectral analysis on action trajectories shows low-frequency components are recovered in early denoising steps (coarse), while high-frequency details are added later—empirically supporting the coarse-to-fine generation design.

5. Implementation and Practical Considerations

The visual encoder uses a VQGAN-style convolutional network per depth layer with codebook size V=512V=512 and channel dimension C=128C=128. The denoiser ϵθ\epsilon_\theta is a U-Net architecture with approximately $2$ million parameters, jointly processing the noisy action ata^t, visual feature f^k\hat f_k, and MLP-proprioceptive embedding qq. Training utilizes 4×\timesNVIDIA V100 GPUs for approximately 1 million steps per benchmark, with wall times around 48 hours. Inference achieves \approx12 fps in simulation and \approx24 fps in real-world deployment (using asynchronous action queueing).

Runtime strategies for real hardware include asynchronous inference-execution decoupling, temporal ensembling of action sequences, and "p-masking" during early training to promote vision-dependent control.

6. Context and Broader Significance

H3^{3}DP advances visuomotor policy learning by explicitly integrating depth geometry, semantic multi-scale vision, and hierarchical action generation. The structured coupling is empirically validated to outperform depth-concatenated and point-cloud counterparts in both simulation and physical bimanual manipulation, with substantial generalization to unseen object configurations. The design and results support stronger integration of hierarchical representations and generative modeling in robotic policy architectures (Lu et al., 12 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Triply-Hierarchical Diffusion Policy (H$^{3}$DP).