Learning a Diffusion Model Policy from Rewards via Q-Score Matching (2312.11752v4)
Abstract: Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are implicitly multi-modal and explorative in continuous domains. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines. Source code is available from the project website: https://michaelpsenka.io/qsm.
- Reinforcement learning with a gaussian mixture model. In The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2010.
- Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
- Richard Bellman. Dynamic programming and a new formalism in the calculus of variations. Proceedings of the national academy of sciences, 40(4):231–235, 1954.
- Model-based action exploration for learning dynamic motion skills. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1540–1546. IEEE, 2018.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
- Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
- How to learn a useful critic? model-based action-gradient-estimator policy optimization. Advances in Neural Information Processing Systems, 33:313–324, 2020.
- Deterministic and stochastic optimal control, volume 1. Springer Science & Business Media, 2012.
- Addressing function approximation error in actor-critic methods, 2018.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
- Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721, 2021.
- Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pp. 9902–9915. PMLR, 2022.
- Efficient diffusion policies for offline reinforcement learning. arXiv preprint arXiv:2305.20081, 2023.
- Donald E Kirk. Optimal control theory: an introduction. Courier Corporation, 2004.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
- Paul Langevin. Sur la théorie du mouvement brownien. C. R. Acad. Sci., 146:530–533, 1908.
- Exponential family model-based reinforcement learning via score matching. Advances in Neural Information Processing Systems, 35:28474–28487, 2022a.
- Prag: Periodic regularized action gradient for efficient continuous control. In Pacific Rim International Conference on Artificial Intelligence, pp. 106–119. Springer, 2022b.
- Continuous control with deep reinforcement learning, 2019.
- Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. arXiv preprint arXiv:2304.12824, 2023.
- George Papanicolaou. Martingale approach to some limit theorems. In Papers from the Duke Turbulence Conference, Duke Univ., Durham, NC, 1977, 1977.
- Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Recomposing the reinforcement learning building blocks with hypernetworks. In International Conference on Machine Learning, pp. 9301–9312. PMLR, 2021.
- Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Pmlr, 2014.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. arXiv preprint arXiv:2306.14079, 2023.
- The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405–420, 2018.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- dm control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/pii/S2665963820300099.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688, 2011.
- Daydreamer: World models for physical robot learning. Proceedings of Machine Learning Research, 2022.