Distributional Successor Features Enable Zero-Shot Policy Optimization (2403.06328v4)
Abstract: Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.
- Lipschitz continuity in model-based reinforcement learning. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 264–273. PMLR, 2018. URL http://proceedings.mlr.press/v80/asadi18a.html.
- Successor features for transfer in reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
- A distributional perspective on reinforcement learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 449–458. PMLR, 2017. URL http://proceedings.mlr.press/v70/bellemare17a.html.
- Universal successor features approximators. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=S1VWjiRcKX.
- RT-1: robotics transformer for real-world control at scale. In Bekris, K. E., Hauser, K., Herbert, S. L., and Yu, J. (eds.), Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi: 10.15607/RSS.2023.XIX.025. URL https://doi.org/10.15607/RSS.2023.XIX.025.
- Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550/ARXIV.2307.15217. URL https://doi.org/10.48550/arXiv.2307.15217.
- Self-supervised reinforcement learning that transfers using random features. CoRR, abs/2305.17250, 2023. doi: 10.48550/ARXIV.2305.17250. URL https://doi.org/10.48550/arXiv.2305.17250.
- Decision transformer: Reinforcement learning via sequence modeling. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 15084–15097, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/7f489f642a0ddb10272b5c31057f0663-Abstract.html.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR, abs/1805.12114, 2018. URL http://arxiv.org/abs/1805.12114.
- PILCO: A model-based and data-efficient approach to policy search. In Getoor, L. and Scheffer, T. (eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 465–472. Omnipress, 2011. URL https://icml.cc/2011/papers/323_icmlpaper.pdf.
- Diffusion models beat gans on image synthesis. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 8780–8794, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/49ad23d1ec9fa4bd8d77d02681df5cfa-Abstract.html.
- D4RL: Datasets for deep data-driven reinforcement learning. https://arxiv.org/abs/2004.07219, 2020.
- Learning to reach goals via iterated supervised learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=rALA0Xo6yNJ.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Dream to control: Learning behaviors by latent imagination. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS.
- Mastering atari with discrete world models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=0oabwyZbOu.
- A survey on deep reinforcement learning algorithms for robotic manipulation. Sensors, 23(7):3762, 2023. doi: 10.3390/S23073762. URL https://doi.org/10.3390/s23073762.
- Visual affordance and function understanding: A survey. ACM Comput. Surv., 54(3):47:1–47:35, 2022. doi: 10.1145/3446370. URL https://doi.org/10.1145/3446370.
- Learning an embedding space for transferable robot skills. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=rk07ZXZRb.
- Rainbow: Combining improvements in deep reinforcement learning. In McIlraith, S. A. and Weinberger, K. Q. (eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3215–3222. AAAI Press, 2018. doi: 10.1609/AAAI.V32I1.11796. URL https://doi.org/10.1609/aaai.v32i1.11796.
- Denoising diffusion probabilistic models, 2020.
- Howard, R. A. Dynamic programming and markov processes. John Wiley, 1960.
- When to trust your model: Model-based policy optimization. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12498–12509, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5faf461eff3099671ad63c6f3f094f7f-Abstract.html.
- Gamma-models: Generative temporal difference learning for infinite-horizon prediction. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/12ffb0968f2f56e51a59a6beb37b2859-Abstract.html.
- Planning with diffusion for flexible behavior synthesis. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 9902–9915. PMLR, 2022. URL https://proceedings.mlr.press/v162/janner22a.html.
- Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. In Kober, J., Ramos, F., and Tomlin, C. J. (eds.), 4th Conference on Robot Learning, CoRL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA, volume 155 of Proceedings of Machine Learning Research, pp. 2120–2136. PMLR, 2020. URL https://proceedings.mlr.press/v155/julian21a.html.
- A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 2023.
- Reinforcement learning in robotics: A survey. Int. J. Robotics Res., 32(11):1238–1274, 2013. doi: 10.1177/0278364913495721. URL https://doi.org/10.1177/0278364913495721.
- Konda, V. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2002. URL https://hdl.handle.net/1721.1/8120.
- Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8.
- Conservative q-learning for offline reinforcement learning. CoRR, abs/2006.04779, 2020. URL https://arxiv.org/abs/2006.04779.
- Learning accurate long-term dynamics for model-based reinforcement learning. In 2021 60th IEEE Conference on Decision and Control (CDC), Austin, TX, USA, December 14-17, 2021, pp. 2880–2887. IEEE, 2021. doi: 10.1109/CDC45484.2021.9683134. URL https://doi.org/10.1109/CDC45484.2021.9683134.
- Investigating compounding prediction errors in learned dynamics models. CoRR, abs/2203.09637, 2022. doi: 10.48550/ARXIV.2203.09637. URL https://doi.org/10.48550/arXiv.2203.09637.
- Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909, 2018. URL http://arxiv.org/abs/1805.00909.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR, abs/2005.01643, 2020. URL https://arxiv.org/abs/2005.01643.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.
- Asynchronous methods for deep reinforcement learning. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1928–1937. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/mniha16.html.
- Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 7559–7566. IEEE, 2018. doi: 10.1109/ICRA.2018.8463189. URL https://doi.org/10.1109/ICRA.2018.8463189.
- Deep dynamics models for learning dexterous manipulation. In Kaelbling, L. P., Kragic, D., and Sugiura, K. (eds.), 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, volume 100 of Proceedings of Machine Learning Research, pp. 1101–1112. PMLR, 2019. URL http://proceedings.mlr.press/v100/nagabandi20a.html.
- R3M: A universal visual representation for robot manipulation. In Liu, K., Kulic, D., and Ichnowski, J. (eds.), Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp. 892–909. PMLR, 2022. URL https://proceedings.mlr.press/v205/nair23a.html.
- Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
- Hiql: Offline goal-conditioned rl with latent states as actions. Advances in Neural Information Processing Systems, 2023.
- Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
- Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20, 2007.
- Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in Neural Information Processing Systems, 21, 2008.
- A generalist agent. CoRR, abs/2205.06175, 2022. doi: 10.48550/ARXIV.2205.06175. URL https://doi.org/10.48550/arXiv.2205.06175.
- Progressive neural networks. CoRR, abs/1606.04671, 2016. URL http://arxiv.org/abs/1606.04671.
- Model-based reinforcement learning via latent-space collocation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 9190–9201. PMLR, 2021. URL http://proceedings.mlr.press/v139/rybkin21b.html.
- Cog: Connecting new skills to past experience with offline reinforcement learning. Preprint arXiv:2010.14500, 2020.
- Multi-task reinforcement learning with context-based representations. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 9767–9779. PMLR, 2021. URL http://proceedings.mlr.press/v139/sodhani21a.html.
- Denoising diffusion implicit models, 2022.
- Maximum likelihood training of score-based diffusion models. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=AklttWFnxS9.
- Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull., 2(4):160–163, 1991. doi: 10.1145/122344.122377. URL https://doi.org/10.1145/122344.122377.
- Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Sonenberg, L., Stone, P., Tumer, K., and Yolum, P. (eds.), 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), Taipei, Taiwan, May 2-6, 2011, Volume 1-3, pp. 761–768. IFAAMAS, 2011. URL http://portal.acm.org/citation.cfm?id=2031726&CFID=54178199&CFTOKEN=61392764.
- Szepesvári, C. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2010. URL http://dx.doi.org/10.2200/S00268ED1V01Y201005AIM009.
- Investigating multi-task pretraining and generalization in reinforcement learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=sSt9fROSZRO.
- Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
- Tedrake, R. Underactuated Robotics. 2023. URL https://underactuated.csail.mit.edu.
- Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34:13–23, 2021a.
- Learning one representation to optimize all rewards. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 13–23, 2021b. URL https://proceedings.neurips.cc/paper/2021/hash/003dd617c12d444ff9c80f717c3fa982-Abstract.html.
- A survey of multi-task deep reinforcement learning. Electronics, 2020.
- Optimal exploration for model-based RL in nonlinear systems. CoRR, abs/2306.09210, 2023. doi: 10.48550/ARXIV.2306.09210. URL https://doi.org/10.48550/arXiv.2306.09210.
- Technical note q-learning. Mach. Learn., 8:279–292, 1992. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/BF00992698.
- Weld, D. S. Donald a. norman, the psychology of everyday things. Artif. Intell., 41(1):111–114, 1989. doi: 10.1016/0004-3702(89)90083-0. URL https://doi.org/10.1016/0004-3702(89)90083-0.
- Bayesian learning via stochastic gradient langevin dynamics. In Getoor, L. and Scheffer, T. (eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 681–688. Omnipress, 2011. URL https://icml.cc/2011/papers/398_icmlpaper.pdf.
- Information theoretic MPC for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, May 29 - June 3, 2017, pp. 1714–1721. IEEE, 2017. doi: 10.1109/ICRA.2017.7989202. URL https://doi.org/10.1109/ICRA.2017.7989202.
- Outracing champion gran turismo drivers with deep reinforcement learning. Nat., 602(7896):223–228, 2022. doi: 10.1038/S41586-021-04357-7. URL https://doi.org/10.1038/s41586-021-04357-7.
- Learning general world models in a handful of reward-free deployments. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ab6a2c6ee757afe43882121281f6065c-Abstract-Conference.html.
- The benefits of model-based generalization in reinforcement learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 40254–40276. PMLR, 2023. URL https://proceedings.mlr.press/v202/young23a.html.
- Gradient surgery for multi-task learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020a. URL https://proceedings.neurips.cc/paper/2020/hash/3fe78a8acf5fda99de95303940a2420c-Abstract.html.
- Mopo: Model-based offline policy optimization. Preprint arXiv:2005.13239, 2020b.
- Combo: Conservative offline model-based policy optimization. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 28954–28967. Curran Associates, Inc., 2021.
- Autoregressive dynamics models for offline policy evaluation and optimization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=kmqjgSNXby.
- Making linear mdps practical via contrastive representation learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 26447–26466. PMLR, 2022. URL https://proceedings.mlr.press/v162/zhang22x.html.
- On the effectiveness of fine-tuning versus meta-reinforcement learning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/a951f595184aec1bb885ce165b47209a-Abstract-Conference.html.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.