Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning

Published 9 Oct 2023 in cs.LG | (2310.05333v2)

Abstract: Constrained policy search (CPS) is a fundamental problem in offline reinforcement learning, which is generally solved by advantage weighted regression (AWR). However, previous methods may still encounter out-of-distribution actions due to the limited expressivity of Gaussian-based policies. On the other hand, directly applying the state-of-the-art models with distribution expression capabilities (i.e., diffusion models) in the AWR framework is intractable since AWR requires exact policy probability densities, which is intractable in diffusion models. In this paper, we propose a novel approach, $\textbf{Diffusion-based Constrained Policy Search}$ (dubbed DiffCPS), which tackles the diffusion-based constrained policy search with the primal-dual method. The theoretical analysis reveals that strong duality holds for diffusion-based CPS problems, and upon introducing parameter approximation, an approximated solution can be obtained after $\mathcal{O}(1/\epsilon)$ number of dual iterations, where $\epsilon$ denotes the representation ability of the parametrized policy. Extensive experimental results based on the D4RL benchmark demonstrate the efficacy of our approach. We empirically show that DiffCPS achieves better or at least competitive performance compared to traditional AWR-based baselines as well as recent diffusion-based offline RL methods. The code is now available at https://github.com/felix-thu/DiffCPS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2022.
  2. Borkar, V. S. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 78(4):583–602, 1988.
  3. Convex optimization. Cambridge university press, 2004.
  4. Score regularized policy optimization through diffusion behavior. URL http://arxiv.org/abs/2310.07297.
  5. Offline reinforcement learning via high-fidelity generative behavior modeling. In The Eleventh International Conference on Learning Representations, 2022.
  6. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  7. Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33:18353–18363, 2020.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  10. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  11. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp.  1587–1596. PMLR, 2018.
  12. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp.  2052–2062. PMLR, 2019.
  13. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. arXiv preprint arXiv:2007.11091, 2020.
  14. BENCHMARKING OFFLINE REINFORCEMENT LEARN- ING ON REAL-ROBOT HARDWARE.
  15. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp.  1861–1870. PMLR, 2018a.
  16. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  17. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
  18. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  19. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  20. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  21. Efficient diffusion policies for offline reinforcement learning.
  22. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  23. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  24. Auto-encoding variational bayes. arXiv e-prints, pp.  arXiv–1312, 2013.
  25. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp.  5774–5783. PMLR, 2021a.
  26. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021b.
  27. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  28. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  29. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  30. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023.
  31. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  32. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. arXiv preprint arXiv:2304.12824, 2023.
  33. Offline reinforcement learning with value-based episodic memory. arXiv preprint arXiv:2110.09796, 2021.
  34. Deployment-efficient reinforcement learn-ing via model-based offline optimization. Policy, 100:1000s.
  35. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  36. Learning when-to-treat policies. Journal of the American Statistical Association, 116(533):392–409, 2021.
  37. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
  38. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.
  39. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  40. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pp.  1607–1612, 2010.
  41. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  42. Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544, 2018.
  43. Rockafellar, R. Convex analysis. Princeton Math. Series, 28, 1970.
  44. Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.  2256–2265. PMLR, 2015.
  46. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  47. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  48. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. arXiv preprint arXiv:2306.14079, 2023.
  49. Reinforcement Learning: An Introduction. MIT press, 2018.
  50. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
  51. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
  52. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
  53. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  54. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
  55. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  56. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.