Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization (2405.16173v3)
Abstract: Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL is less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' actions. This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance on both cumulative reward and sample efficiency.
- Diffusion policies for out-of-distribution generalization in offline reinforcement learning. IEEE Robotics and Automation Letters, 9(4):3116–3123, April 2024.
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- Simple hierarchical planning with diffusion. arXiv preprint arXiv:2401.02644, 2024.
- Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548, 2022.
- Boosting continuous control with consistency policy. arXiv preprint arXiv:2310.06343, 2023.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- DiffuserLite: Towards real-time diffusion planning. arXiv preprint arXiv:2401.15443, 2024.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
- Diffcps: Diffusion model based constrained policy search for offline reinforcement learning. arXiv preprint arXiv:2310.05333, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Diffuseloco: Real-time legged locomotion control with diffusion from offline datasets, 2024.
- Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
- Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- 3d diffuser actor: Policy diffusion with 3d scene representations, 2024.
- Learning to act from actionless videos through dense correspondences, 2023.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Wenhao Li. Efficient planning with latent diffusion. arXiv preprint arXiv:2310.00311, 2023.
- Hierarchical diffusion for offline decision making. In International Conference on Machine Learning, pages 20035–20064. PMLR, 2023.
- Crossway Diffusion: Improving diffusion-based visuomotor policy via self-supervised learning. arXiv preprint arXiv:2307.01849, 2023.
- Learning visuotactile skills with two multifingered hands, 2024.
- Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In International Conference on Machine Learning, pages 22825–22855. PMLR, 2023.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
- Metadiffuser: Diffusion model as conditional planner for offline meta-rl. In International Conference on Machine Learning, pages 26087–26105. PMLR, 2023.
- Imitating human behaviour with diffusion models, 2023.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1607–1612, 2010.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
- Learning a diffusion model policy from rewards via q-score matching. arXiv preprint arXiv:2312.11752, 2023.
- Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.
- Movement primitive diffusion: Learning gentle robotic manipulation of deformable objects, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Nomad: Goal masked diffusion policies for navigation and exploration, 2023.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
- Diffusion policies as an expressive policy class for offline reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022.
- Guidance with spherical gaussian constraint for conditional diffusion. arXiv preprint arXiv:2402.03201, 2024.
- Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122, 2023.
- Versatile navigation under partial observability via value-guided diffusion policy, 2024.
- Madiff: Offline multi-agent learning with diffusion models, 2023.
- Diffusion models for reinforcement learning: A survey. arXiv preprint arXiv:2311.01223, 2023.
- Shutong Ding (8 papers)
- Ke Hu (57 papers)
- Zhenhao Zhang (11 papers)
- Kan Ren (41 papers)
- Weinan Zhang (322 papers)
- Jingyi Yu (171 papers)
- Jingya Wang (68 papers)
- Ye Shi (51 papers)