Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Policy-Guided Diffusion (2404.06356v1)

Published 9 Apr 2024 in cs.LG, cs.AI, and cs.RO

Abstract: In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy. Such a setting naturally leads to distribution shift between the behavior policy and the target policy being trained - requiring policy conservatism to avoid instability and overestimation bias. Autoregressive world models offer a different solution to this by generating synthetic, on-policy experience. However, in practice, model rollouts must be severely truncated to avoid compounding error. As an alternative, we propose policy-guided diffusion. Our method uses diffusion models to generate entire trajectories under the behavior distribution, applying guidance from the target policy to move synthetic experience further on-policy. We show that policy-guided diffusion models a regularized form of the target distribution that balances action likelihood under both the target and behavior policies, leading to plausible trajectories with high target policy probability, while retaining a lower dynamics error than an offline world model baseline. Using synthetic experience from policy-guided diffusion as a drop-in substitute for real data, we demonstrate significant improvements in performance across a range of standard offline reinforcement learning algorithms and environments. Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sP1fo2K9DFG.
  2. Diffusion world models. 2023.
  3. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
  4. Augmented world models facilitate zero-shot dynamics generalization from a single offline environment. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  619–629. PMLR, 18–24 Jul 2021.
  5. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  843–852, 2023.
  6. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  7. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  10. A minimalist approach to offline reinforcement learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  11. Off-policy deep reinforcement learning without exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2052–2062. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/fujimoto19a.html.
  12. Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. arXiv preprint arXiv:2305.18459, 2023.
  13. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
  14. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  15. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
  16. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
  17. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  18. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  19. A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264, January 2023. ISSN 1076-9757. doi: 10.1613/jair.1.14174. URL http://dx.doi.org/10.1613/jair.1.14174.
  20. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  21. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  22. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  23. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=zz9hXVhf40.
  24. Synthetic experience replay. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=6jNQ1AY1Uf.
  25. Elucidating the design space of classifier-guided diffusion generation. arXiv preprint arXiv:2310.11311, 2023.
  26. Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp.  80, 2000.
  27. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pp.  759–766, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072.
  28. RAMBO-RL: Robust adversarial model-based offline reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=nrksGSRT7kX.
  29. World models via policy-guided trajectory diffusion, 2023.
  30. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  31. The edge-of-reach problem in offline model-based reinforcement learning, 2024.
  32. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp.  2256–2265, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/sohl-dickstein15.html.
  33. Yihao Sun. Offlinerl-kit: An elegant pytorch offline reinforcement learning library. https://github.com/yihaosun1124/OfflineRL-Kit, 2023.
  34. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
  35. Mujoco: A physics engine for model-based control. IEEE, pp.  5026–5033, 2012. URL http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12.
  36. Diffusion policies as an expressive policy class for offline reinforcement learning, 2023.
  37. Mopo: Model-based offline policy optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  14129–14142. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/a322852ce0df73e204b7e67cbbef0d0a-Paper.pdf.
  38. Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv preprint arXiv:2311.01017, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Matthew Thomas Jackson (7 papers)
  2. Michael Tryfan Matthews (1 paper)
  3. Cong Lu (23 papers)
  4. Benjamin Ellis (12 papers)
  5. Shimon Whiteson (122 papers)
  6. Jakob Foerster (101 papers)
Citations (10)

Summary

Policy-Guided Diffusion for Improving Offline RL with Synthetic Data Generation

Introduction to Policy-Guided Diffusion

Offline Reinforcement Learning (RL) presents a uniquely challenging scenario where agents must learn solely from a static dataset, usually collected under different behavior policies, without any further interaction with the environment. This situation inevitably leads to a distribution shift problem, where the agent's learned policy (target policy) deviates from the data-collection policy (behavior policy), causing stability and performance issues due to out-of-sample generalization errors. To address these challenges, we introduce a novel approach called Policy-Guided Diffusion (PGD), which innovatively generates synthetic, on-policy experience by applying diffusion models guided by the target policy. This methodology significantly enhances the performance of offline RL agents by providing them with augmented, relevant training data, propelling them closer to the desired target policy behavior without succumbing to the limitations of distribution shift and model compounding errors typically observed in offline settings.

Core Contributions of PGD

PGD stands out by generating entire trajectories that lie closer to the target distribution through a careful balance of guidance from the target policy and adherence to the behavior policy. The approach provides several key benefits:

  • Reduction of Distribution Shift: By generating synthetic data that the target policy is likely to encounter, PGD effectively reduces the distribution shift problem, allowing for more stable and accurate policy optimization.
  • Mitigation of Compounding Errors: Unlike traditional model-based methods that suffer from compounding errors due to their autoregressive nature, PGD generates entire trajectories in a single step, significantly reducing dynamics errors even when modeling off-policy data.
  • Performance Improvement Across Benchmarks: PGD has demonstrated substantial improvements in several standard offline RL algorithms and environments, showcasing its versatility and effectiveness as a novel data generation methodology.

Theoretical Foundation and Practical Implementation

At the heart of PGD lies a diffusion process guided by the gradient of the action distribution under the target policy. This guidance process steers the generation of synthetic trajectories towards higher likelihood under the target policy, effectively creating a regularized form of the target distribution that balances between target and behavior policy action likelihoods. The approach is grounded on a solid theoretical derivation that models the behavior-regularized target distribution, facilitating a method that does not suffer from the classical pitfalls of model-based offline RL methods.

Practically, PGD involves a procedural generation of synthetic trajectories where each diffusion step is directly modified by the target policy's preferences, ensuring that the synthetic data remains relevant and beneficial for training the agent. This process includes mechanisms for controlling the strength of policy guidance and stabilizing the guided diffusion to mitigate potential variance issues, making PGD a robust and flexible tool for offline RL.

Implications and Future Prospects

The application of PGD offers an exciting avenue for the development of more robust, efficient, and performant offline RL systems. By providing a mechanism to generate on-policy, high-quality synthetic data, PGD not only improves the immediate performance of existing offline RL algorithms but also sets the stage for new methodological advancements that could further exploit this synthetic data generation capability.

Looking ahead, the robustness of PGD to target policy variations and its ability to mitigate dynamics errors opens up potential research directions, including the exploration of automated techniques for tuning the guidance strength or extending the methodology to more complex environments and policy structures. Furthermore, the integration of PGD with other offline RL adjustments, such as advanced regularization techniques or novel policy optimization strategies, could yield even more significant performance gains.

In summary, Policy-Guided Diffusion represents a monumental step forward in the offline RL domain, bridging the gap between theoretical innovation and practical applicability. Its ability to generate synthetic, on-policy training data in a controlled manner addresses some of the most pressing challenges in offline RL, offering a scalable solution that enhances the performance and reliability of offline RL systems across various settings.