- The paper introduces Hierarchical Diffusion Policy (HDP), a novel imitation learning method that uses a two-layer diffusion model (Guider and Actor) to improve performance and controllability, particularly in contact-rich robot manipulation tasks.
- HDP features technical innovations like snapshot gradient optimization, 3D conditioning, and prompt guidance, leading to superior interpretability and allowing human intervention via specified contact objectives.
- Evaluations demonstrate HDP significantly outperforms Diffusion Policy by an average of 20.8% across various simulated and real-world tasks, showcasing its robustness with different objects and operations.
This paper introduces Hierarchical Diffusion Policy (HDP), a novel imitation learning method for robot manipulation trajectory generation, especially in contact-rich tasks. HDP addresses the limitations of end-to-end policies, like Diffusion Policy, which struggle with tasks involving rich contacts and have limited controllability.
HDP employs a hierarchical approach, dividing the policy into two layers: a high-level policy (Guider) and a low-level policy (Actor). The Guider predicts the contact position for the robot's next object manipulation based on 3D information, while the Actor predicts the action sequence required to reach the predicted contact, using observations and contact information as input. Both policies are modeled as conditional denoising diffusion processes. The Actor is optimized using a combination of behavioral cloning and Q-learning, enabling it to accurately guide actions towards the desired contact while learning the conditional probability distribution of actions.
The paper highlights three key advantages of HDP: superior performance due to breaking down the task, greater interpretability because of the hierarchical structure, and stronger controllability, allowing human intervention via specifying objective contact positions.
To further improve HDP, the paper presents three technical contributions:
- Snapshot gradient optimization: A faster training method where the gradient is calculated at only one timestep in the diffusion process per iteration, in contrast to prior work's less efficient method of calculating the gradient at all timesteps.
- 3D conditioning: Improves the policy's perception by using 3D point clouds to represent the state. The point cloud representation is extracted once and used throughout the denoising iterations.
- Prompt guidance: Allows manual specification of objective contacts to generate customized operation trajectories, improving human-machine interaction.
The HDP was evaluated across six different tasks and showed significant performance improvements over Diffusion Policy, with an average improvement of 20.8%. Real-world experiments demonstrate that HDP can handle both rigid and deformable objects. The authors also provide detailed analysis of the algorithm's characteristics and design decisions.
The evaluation includes comparisons with LSTM-GMM, IBC, and BET, in addition to Diffusion Policy. HDP was tested in both simulated and real-world environments, with varying action space dimensions (2DOF to 6DOF), prehensile and non-prehensile operations, and with rigid and deformable objects. The demonstrations used for training were collected by both single and multiple users. The results indicate that HDP outperforms existing methods in terms of success rate and robustness.
The paper also investigates the importance of phased objective contacts, a 3D vision encoder, and Q-learning action steps. The benefits of HDP also include its ability to decompose multi-modal action distributions and provide a framework for prompt guidance.