Triply-Hierarchical Diffusion Policy (H3DP)
- The paper presents a novel visuomotor framework that couples depth-aware input layering, multi-scale feature extraction, and a hierarchically conditioned diffusion process for action generation, achieving a +27.5% improvement over baselines.
- The methodology employs a three-level hierarchical design that coordinates visual perception and action control, leading to enhanced semantic alignment and a +32.3% success rate boost in challenging bimanual tasks.
- Experimental results demonstrate significant advancements with a 75.6% average simulation success rate and robust real-world performance in dual-arm manipulation settings.
Triply-Hierarchical Diffusion Policy (HDP) is a visuomotor learning framework for robotic manipulation that couples perception and action through a three-level hierarchy: (1) depth-aware input layering of RGB-D images, (2) multi-scale visual feature representations, and (3) a hierarchically conditioned diffusion process for action generation. HDP is designed to strengthen semantic alignment between visual features and action outputs by explicitly structuring perception and control in a coordinated fashion. This architecture achieves significant relative improvements in manipulation success rates over previous diffusion policy baselines in both simulation and real-world dual-arm settings, including a +27.5% average relative improvement across 44 simulated benchmarks and +32.3% on challenging bimanual real-world tasks (Lu et al., 12 May 2025).
1. Hierarchical Architecture
HDP introduces a triple-hierarchy coupling visual input and action generation:
- Depth-Aware Input Layering: RGB-D observations are split into discrete layers along the depth axis, stratifying the scene into foreground and background sub-images. Each pixel’s depth is mapped to a layer using the partition:
This operation creates RGB-D images , spatially organizing the visual input by geometric structure.
- Multi-Scale Visual Representations: Each depth-layer image is independently encoded at spatial resolutions. Let denote the feature map for layer and scale . Feature vectors are quantized using a learnable -entry codebook ,
Multiscale features are bilinearly upsampled to a common base size and convolved for cross-scale consistency:
The encoder is trained via a bidirectional consistency loss ensuring scale-aligned representations:
- Hierarchically Conditioned Diffusion Process: Policy generation uses a denoising diffusion model of length . At diffusion step , action is produced via a forward process with cosine-scheduled noise and a reverse process hierarchically conditioned on the multi-scale features. The reverse (denoising) process partitions timesteps into intervals; at in interval , the policy conditions on features :
is computed via a standard DPM update. The overall joint action distribution is:
2. Training Objectives and Optimization
The optimization objective jointly trains the visual encoder, codebooks, and diffusion denoiser network:
with the denoising loss given by standard score matching:
Key hyperparameters include diffusion steps (inference: ), depth layers , feature scales at spatial resolutions , AdamW optimizer with learning rate , batch size $128$, cosine learning rate scheduling, and weight decay .
3. Experimental Evaluation
Simulated Benchmarks: HDP is evaluated across 44 simulated manipulation tasks spanning MetaWorld (Medium 11, Hard 5, Hard++ 5), ManiSkill (4 deformable, 4 rigid), Adroit (3), DexArt (4), and RoboTwin (8). Baselines include Diffusion Policy (DP), DP with depth, and DP3 (point-cloud). The key metric is success rate, averaged over 3 seeds and 20 evaluation episodes per seed.
| Method | DP | DP (w/ depth) | DP3 | HDP |
|---|---|---|---|---|
| Avg. Success (%) | 48.1 | 52.8 | 59.3 | 75.6 |
HDP achieves a +27.5% relative improvement over DP3.
Real-World Evaluation: On the Galaxea R1 dual-arm robot, four long-horizon manipulation tasks are used (20 trials each). HDP outperforms DP by +32.3% average:
| Task | DP | HDP |
|---|---|---|
| CF | 13 | 51 |
| PJ | 6 | 47 |
| PB | 38 | 52 |
| ST | 57 | 65 |
| Avg. | 28.5 | 51.3 |
Instance-level generalization (e.g., on Place Bottle and Sweep Trash with varying objects) yields a +15.4% improvement. Success rates indicate robust transfer to real-world settings (Lu et al., 12 May 2025).
4. Component Ablations and Analysis
Ablations measure the impact of each hierarchical module:
| Component removed | Avg. success (%) |
|---|---|
| w/o depth layering | 46.5 |
| w/o multi-scale repr. | 48.7 |
| w/o hierarchical act. | 49.0 |
| Full HDP | 59.6 |
Performance degrades substantially when any component is removed, indicating all three hierarchical levels contribute significantly. Choice of depth-layer count ( or $4$) is optimal; over-partitioning () reduces performance but remains above the non-layered baseline. Spectral analysis on action trajectories shows low-frequency components are recovered in early denoising steps (coarse), while high-frequency details are added later—empirically supporting the coarse-to-fine generation design.
5. Implementation and Practical Considerations
The visual encoder uses a VQGAN-style convolutional network per depth layer with codebook size and channel dimension . The denoiser is a U-Net architecture with approximately $2$ million parameters, jointly processing the noisy action , visual feature , and MLP-proprioceptive embedding . Training utilizes 4NVIDIA V100 GPUs for approximately 1 million steps per benchmark, with wall times around 48 hours. Inference achieves 12 fps in simulation and 24 fps in real-world deployment (using asynchronous action queueing).
Runtime strategies for real hardware include asynchronous inference-execution decoupling, temporal ensembling of action sequences, and "p-masking" during early training to promote vision-dependent control.
6. Context and Broader Significance
HDP advances visuomotor policy learning by explicitly integrating depth geometry, semantic multi-scale vision, and hierarchical action generation. The structured coupling is empirically validated to outperform depth-concatenated and point-cloud counterparts in both simulation and physical bimanual manipulation, with substantial generalization to unseen object configurations. The design and results support stronger integration of hierarchical representations and generative modeling in robotic policy architectures (Lu et al., 12 May 2025).