Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

World Models via Policy-Guided Trajectory Diffusion (2312.08533v4)

Published 13 Dec 2023 in cs.LG and cs.AI

Abstract: World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in "in imagination". Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy. Prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the connections between PolyGRAD, score-based generative models, and classifier-guided diffusion models. Our results demonstrate that PolyGRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for short trajectories, with the exception of autoregressive diffusion. For short trajectories, PolyGRAD obtains similar errors to autoregressive diffusion, but with lower computational requirements. For long trajectories, PolyGRAD obtains comparable performance to baselines. Our experiments demonstrate that PolyGRAD enables performant policies to be trained via on-policy RL in imagination for MuJoCo continuous control domains. Thus, PolyGRAD introduces a new paradigm for accurate on-policy world modelling without autoregressive sampling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Is conditional generative modeling all you need for decision-making? International Conference on Learning Representations, 2023.
  2. Anonymous. Diffusion world models. In Under review at ICLR, 2023.
  3. Model-based offline planning. International Conference on Learning Representations, 2021.
  4. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
  5. Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481, 2022.
  6. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  7. Genaug: Retargeting behaviors to unseen situations via generative augmentation. Robotics: Science and Systems, 2023.
  8. Diffusion policy: Visuomotor policy learning via action diffusion. Robotics: Science and Systems, 2023.
  9. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 31, 2018.
  10. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011.
  11. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 2021.
  12. Probabilistic recurrent state-space models. In International Conference on Machine Learning, 2018.
  13. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, 2018.
  14. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
  15. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, 2019.
  16. Mastering atari with discrete world models. International Conference on Learning Representations, 2021.
  17. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  18. Idql: Implicit Q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
  19. Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. Advances in Neural Information Processing Systems, 2023.
  20. Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  21. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
  22. Policy-guided diffusion. NeurIPS 2023 Workshop on Robot Learning: Pretraining, Fine-Tuning, and Generalization with Large Scale Models, 2023.
  23. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
  24. Planning with diffusion for flexible behavior synthesis. International Conference on Machine Learning, 2022.
  25. Model-based reinforcement learning for atari. International Conference on Learning Representations, 2020.
  26. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  27. Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail, 2018.
  28. Learning accurate long-term dynamics for model-based reinforcement learning. In IEEE Conference on Decision and Control, 2021.
  29. Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637, 2022.
  30. Synthetic experience replay. Advances in Neural Information Processing Systems, 2023.
  31. Repaint: Inpainting using denoising diffusion probabilistic models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  32. Transformers are sample efficient world models. International Conference on Learning Representations, 2023.
  33. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
  34. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.
  35. Imitating human behaviour with diffusion models. International Conference on Learning Representations, 2023.
  36. Stable baselines 3, 2019.
  37. Reward-free curricula for training robust world models. arXiv preprint arXiv:2306.09205, 2023.
  38. Transformer-based world models are happy with 100k interactions. International Conference on Learning Representations, 2023.
  39. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  40. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588:604–609, 2020.
  41. Trust region policy optimization. In International Conference on Machine Learning, 2015.
  42. High-dimensional continuous control using generalized advantage estimation. International Conference on Learning Representations, 2016.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. Planning to explore via self-supervised world models. In International Conference on Machine Learning, 2020.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
  46. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 2019.
  47. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021.
  48. Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  49. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  50. Diffusion policies as an expressive policy class for offline reinforcement learning. International Conference on Learning Representations, 2023.
  51. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, 2011.
  52. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. Advances in Neural Information Processing Systems, 2023.
  53. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 2021.
  54. MOPO: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 2020.
  55. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023.
  56. Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv preprint arXiv:2311.01017, 2023.
  57. Diffusion models for reinforcement learning: A survey. arXiv preprint arXiv:2311.01223, 2023.
Citations (12)

Summary

  • The paper presents the innovative PolyGRAD method, replacing autoregressive sampling with diffusion-based trajectory generation to significantly reduce cumulative prediction errors.
  • It employs a denoising model with dynamic policy guidance to create entire on-policy trajectories in a single computational pass.
  • Extensive experiments, particularly on MuJoCo environments, validate PolyGRAD's competitive performance and its potential to enhance scalable reinforcement learning applications.

Analyzing "World Models via Policy-Guided Trajectory Diffusion"

The paper "World Models via Policy-Guided Trajectory Diffusion" introduces an innovative approach to world modeling in reinforcement learning (RL) that challenges the traditional autoregressive paradigm. The authors propose a method called Policy-Guided Trajectory Diffusion (PolyGRAD) which leverages diffusion models to generate on-policy trajectories in a single computational pass, thereby circumventing the error accumulation typically associated with autoregressive models.

Overview

PolyGRAD addresses a critical shortcoming of existing world models used in reinforcement learning, which is their reliance on autoregressive processes that interleave state prediction with policy-based action sampling. In such frameworks, prediction errors compound as the trajectory length increases, undermining the efficacy of synthetic data generated for policy optimization. PolyGRAD eschews autoregressive sampling, instead using a denoising model in tandem with the policy to diffuse random trajectories into coherent on-policy sequences.

The PolyGRAD Approach

The core innovation in PolyGRAD lies in applying diffusion models to RL world modeling, allowing for the creation of entire on-policy trajectories without sequential sampling. The methodology involves:

  1. Denoising Model Training: The model learns to predict the noise added to state and reward sequences, conditioned on actions. This differs from standard autoregressive approaches by considering entire trajectories rather than one-step transitions.
  2. Policy Guidance: Instead of extracting actions directly from a neural policy model, PolyGRAD guides trajectory diffusion via the gradient of the policy's action distribution. This allows the generation of trajectories that are congruent with the policy's statistical characteristics without requiring iterative state-by-state prediction.
  3. Automatic Tuning: The magnitude of action updates is dynamically adjusted to ensure that the synthetic trajectories preserve the distributional properties of the policy, maintaining on-policy characteristics throughout training.

Through extensive experimentation, the authors demonstrate that PolyGRAD can outperform traditional models in short trajectory generation while matching the performance of state-of-the-art models in longer trajectories. This suggests that PolyGRAD offers a computationally efficient alternative, requiring fewer resources than models that maintain arcs of autoregressive sampling, particularly in environments like MuJoCo.

Implications and Future Directions

The implications of this work span both theoretical and practical domains. On the theoretical side, the application of diffusion models in RL provides a framework that bypasses the iterative error accumulation inherent in traditional methods, potentially leading to more robust policy optimization. Practically, the reduced computational demand of PolyGRAD could enhance the scalability of RL applications where resources are constrained.

Nevertheless, PolyGRAD's efficacy in handling low-entropy policy distributions suggests room for further refinement. Future work could enhance the model's adaptability across various policy entropy levels or extend its application to more complex, high-dimensional landscapes beyond the MuJoCo suite tested.

Overall, "World Models via Policy-Guided Trajectory Diffusion" presents a significant step forward in the evolution of model-based reinforcement learning, establishing a foundation for future advancements in using non-autoregressive sampling methods in synthetic trajectory generation.