Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning (2402.03570v4)
Abstract: We introduce Diffusion World Model (DWM), a conditional diffusion model capable of predicting multistep future states and rewards concurrently. As opposed to traditional one-step dynamics models, DWM offers long-horizon predictions in a single forward pass, eliminating the need for recursive queries. We integrate DWM into model-based value estimation, where the short-term return is simulated by future trajectories sampled from DWM. In the context of offline reinforcement learning, DWM can be viewed as a conservative value regularization through generative modeling. Alternatively, it can be seen as a data source that enables offline Q-learning with synthetic data. Our experiments on the D4RL dataset confirm the robustness of DWM to long-horizon simulation. In terms of absolute performance, DWM significantly surpasses one-step dynamics models with a $44\%$ performance gain, and is comparable to or slightly surpassing their model-free counterparts.
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- On the model-based stochastic value gradient for continuous reinforcement learning. In Learning for Dynamics and Control, pp. 6–20. PMLR, 2021.
- Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320, 2019.
- Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481, 2022.
- Decision transformer: Reinforcement learning via sequence modeling. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=a7APmM4B9d.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
- Iql-td-mpc: Implicit q-learning for hierarchical model predictive control. arXiv preprint arXiv:2306.00867, 2023.
- On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
- A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
- Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
- Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
- Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- A minimalist approach to offline reinforcement learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Q32U7dzWXpc.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. PMLR, 2019.
- Extreme q-learning: Maxent rl without entropy. arXiv preprint arXiv:2301.02328, 2023.
- Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267–18281, 2022.
- World models. arXiv preprint arXiv:1803.10122, 2018.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a.
- Learning latent dynamics for planning from pixels. In International conference on machine learning, pp. 2555–2565. PMLR, 2019b.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Modem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint arXiv:2212.05698, 2022a.
- Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022b.
- Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828, 2023.
- Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- gamma-models: Generative temporal difference learning for infinite-horizon prediction. Advances in Neural Information Processing Systems, 33:1724–1735, 2020.
- Offline reinforcement learning as one big sequence modeling problem. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=wgeK563QgSw.
- Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
- Chain-of-thought predictive control. arXiv preprint arXiv:2304.00776, 2023.
- Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
- Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- Offline reinforcement learning with implicit q-learning, 2021.
- Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
- Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
- Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637, 2022.
- Multi-game decision transformers. arXiv preprint arXiv:2205.15241, 2022.
- Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Offline pre-trained multi-agent decision transformer: One big sequence model conquers all starcraftii tasks. arXiv preprint arXiv:2112.02845, 2021.
- Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588, 2022.
- Mish, M. D. A self regularized non-monotonic activation function [j]. arXiv preprint arXiv:1908.08681, 2019.
- Reorientdiff: Diffusion model based reorientation for object manipulation. arXiv preprint arXiv:2303.12700, 2023.
- Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 7559–7566. IEEE, 2018.
- Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023.
- Conserweightive behavioral cloning for reliable offline reinforcement learning. arXiv preprint arXiv:2210.05158, 2022.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, 2015.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Pmlr, 2014.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pp. 216–224. Elsevier, 1990.
- Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 connectionist models summer school, pp. 255–263. Psychology Press, 2014.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Q-learning. Machine learning, 8:279–292, 1992.
- Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- Learning to combat compounding-error in model-based reinforcement learning. arXiv preprint arXiv:1912.11206, 2019.
- Controllable video generation by learning the underlying dynamical system with neural ode. arXiv preprint arXiv:2303.05323, 2023.
- Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. In International Conference on Machine Learning, pp. 38989–39007. PMLR, 2023.
- Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
- Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
- Semi-supervised offline reinforcement learning with action-free trajectories. In International Conference on Machine Learning, pp. 42339–42362. PMLR, 2023a.
- Guided flows for generative modeling and decision making. arXiv preprint arXiv:2311.13443, 2023b.