Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient (2406.00681v1)

Published 2 Jun 2024 in cs.LG

Abstract: Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.

PDF HTML Abstract

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

The paper "Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient" introduces Deep Diffusion Policy Gradient (DDiffPG), an innovative actor-critic algorithm designed for the online training of multimodal policies using diffusion models. This methodology is notable for its ability to explore, discover, and maintain diverse behavioral modes, addressing the constraints faced by traditional deep reinforcement learning (RL) algorithms in multimodal policy learning.

Key Contributions

Diffusion Policy Gradient: The core advancement is the diffusion policy gradient technique, allowing the efficient training of diffusion models for RL objectives. This method computes action gradients, updates target actions iteratively, and utilizes a behavioral cloning objective to train the diffusion policy, ensuring stability and avoiding vanishing gradients.
Mode Discovery through Clustering: DDiffPG employs hierarchical clustering on trajectories to discover distinct behavioral modes autonomously. This approach is modulated by intrinsic motivation based on state novelty, facilitating diverse exploration.
Mode-Specific Q-Learning: To prevent mode collapse, where an RL policy becomes greedy and favors a single mode, DDiffPG introduces mode-specific Q-functions. This facilitates the optimization of distinct modes independently, leveraging a multimodal training batch for the diffusion policy.
Mode Control with Latent Embeddings: The final contribution is conditioning the diffusion policy on mode-specific embeddings, allowing explicit control over which behavioral mode to execute. This is particularly beneficial in non-stationary environments, enabling efficient rerouting and performance optimization by selecting the most appropriate mode during execution.

Empirical Validation

Through an extensive experimental setup, DDiffPG demonstrated its efficacy in mastering multimodal behaviors across various high-dimensional continuous control tasks. The tasks include a series of AntMaze navigation problems and complex robotic manipulation scenarios. Key observations include:

Multimodal Behavior Mastery: DDiffPG consistently discovered and utilized multiple modes to solve tasks, contrasting sharply with baseline methods that often collapsed to a single mode. In AntMaze-v3, the algorithm successfully learned and executed all four viable paths to the two goals.
Exploration Advantage: The intrinsic motivation-driven exploration allowed DDiffPG to cover more state space than other methods, as evidenced by significantly higher state coverage rates and diverse behavior density maps.
Robustness to Local Minima: Unlike baseline algorithms, which often got trapped in easier, suboptimal solutions, DDiffPG was able to explore alternative paths and optimize for higher returns. This feature was clearly illustrated in AntMaze-v2, where the algorithm navigated to both the easier and the more rewarding but challenging goal locations.

Implications and Future Directions

The implications of DDiffPG are multifaceted. Practically, this approach enhances the ability of RL agents to operate in dynamically changing environments by leveraging the learned multimodal behaviors. Theoretically, it advocates for a shift towards explicitly discovering and optimizing diverse solutions within the RL framework.

Future work could extend this methodology to tackle more extensive environments that pose significant exploration challenges. Further refinement in distance metrics for clustering and the integration of more sophisticated intrinsic motivation mechanisms could enhance scalability. Moreover, as diffusion models continue to improve in terms of computational efficiency, real-time applications, particularly in robotic control, could greatly benefit from adopting DDiffPG.

In conclusion, the paper paves the way for a new paradigm in the online training of multimodal policies using diffusion models. By addressing the limitations of traditional deep RL and offering robust solutions for maintaining diverse behaviors, DDiffPG stands as a significant advancement in the field of policy learning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zechu Li (7 papers)
Rickmer Krohn (1 paper)
Tao Chen (397 papers)
Anurag Ajay (15 papers)
Pulkit Agrawal (103 papers)
Georgia Chalvatzaki (44 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/softraeh/status/1798375368644321731

https://twitter.com/aryaman_pandya/status/1798433080769651110