Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision-Language Models as a Source of Rewards (2312.09187v3)

Published 14 Dec 2023 in cs.LG

Abstract: Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-LLMs, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

Introduction to Vision-LLMs as Rewards

The ambition of creating versatile AI agents that can navigate and accomplish objectives within complex environments is a major focus in the field of reinforcement learning (RL). A substantial obstacle in this area is the necessity for diverse reward functions to train agents for various goals. The paper scrutinizes using vision-LLMs (VLMs) as a new method for generating rewards in reinforcement learning. Specifically, it looks at pre-trained VLMs, like CLIP, to produce reward signals that do not require further fine-tuning with environment-specific data. The method is demonstrated in two different visual domains, and the results indicate that larger VLMs provide more accurate rewards, leading to more effective RL agents.

Related Work and Methodological Foundations

There is a recent research interest in using VLMs for creating reward functions. Pre-trained VLMs have already displayed proficiency in tasks such as visual detection, classification, and question-answering. The paper outlines efforts where CLIP-based models have been fine-tuned with video and text from Minecraft to develop effective shaping rewards, allowing agents to perform specific tasks more efficiently.

The methodology proposed involves using contrastive VLMs to produce a straightforward binary reward for RL. This process creates an image encoder and a text encoder to generate a reward signal from environmental observations and text-based goals. The rewards serve as indicators for the achievement of defined goals within a partially observable Markov decision process (POMDP), leveraging the intrinsic reward rather than relying on explicitly programmed ground truth rewards.

Empirical Evaluations and Results

In the empirical paper, experiments assess how the use of VLM rewards correlates with the fundamental ground truth rewards and explore the influence of scaling up the VLM. Key research questions include determining if optimizing the VLM reward leads to higher ground truth rewards, and whether larger VLMs enhance the performance of the reward function.

The experimental setup mimics standard online RL, using environments such as Playhouse and AndroidEnv to challenge the agent with tasks like locating objects or opening apps. The essential finding from these experiments is that training agents to maximize VLM-derived rewards concurrently maximizes the actual ground truth reward. Moreover, increasing the size of VLM models improves both their accuracy in offline settings and their effectiveness as a reward signal during RL training.

Conclusion and Practical Implications

The paper demonstrates that pre-existing VLMs can provide precise rewards for visual tasks based on language goals. When the scale of VLMs is increased, the accuracy of reward predictions also improves significantly, which, in succession, leads to better-performing RL agents. These findings suggest that as VLMs continue to evolve, it might become feasible to train generalized agents in visually-rich settings without the need for additional fine-tuning, a step forward in creating more adaptable and capable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, Vol. 47:253–279, 2012. cite arxiv:1207.4708.
  3. High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR, 2021.
  4. Can foundation models perform zero-shot task specification for robot manipulation? In Learning for Dynamics and Control Conference, pages 893–905. PMLR, 2022.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  6. Clip4mc: An rl-friendly vision-language model for minecraft. arXiv preprint arXiv:2303.10571, 2023.
  7. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023a.
  8. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023b.
  9. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 18343–18362. Curran Associates, Inc., 2022.
  10. Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021.
  11. Muesli: Combining improvements in policy optimization, 2021.
  12. Language instructed reinforcement learning for human-ai coordination. arXiv preprint arXiv:2304.07297, 2023.
  13. Reward design with language models. arXiv preprint arXiv:2303.00001, 2023.
  14. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
  15. Continuous control with deep reinforcement learning, 2019.
  16. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  17. Zero-shot reward specification via grounded natural language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14743–14752. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/mahmoudieh22a.html.
  18. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  19. Data-efficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073, 2017.
  20. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
  21. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, jan 2016. ISSN 0028-0836.
  22. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460–9471, 2022.
  23. Roboclip: One demonstration is enough to learn robot policies. arXiv preprint arXiv:2310.07899, 2023.
  24. Distilling internet-scale vision-language models into embodied agents. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32797–32818. PMLR, 23–29 Jul 2023.
  25. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
  26. Semantic exploration from language abstractions and pretrained representations. Advances in Neural Information Processing Systems, 35:25377–25389, 2022.
  27. Creating multimodal interactive agents with imitation and self-supervised learning, 2022.
  28. Gerald Tesauro. Temporal difference learning and td-gammon. J. Int. Comput. Games Assoc., 18(2):88, 1995.
  29. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  30. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.
  31. Robotic skill acquisition via instruction augmentation with vision-language models. arXiv preprint arXiv:2211.11736, 2022.
  32. Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arXiv:2309.11489, 2023.
  33. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021.
  34. Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023.
  35. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (27)
  1. Kate Baumli (10 papers)
  2. Satinder Baveja (2 papers)
  3. Feryal Behbahani (18 papers)
  4. Harris Chan (13 papers)
  5. Gheorghe Comanici (8 papers)
  6. Sebastian Flennerhag (18 papers)
  7. Maxime Gazeau (11 papers)
  8. Kristian Holsheimer (5 papers)
  9. Dan Horgan (9 papers)
  10. Michael Laskin (20 papers)
  11. Clare Lyle (36 papers)
  12. Hussain Masoom (1 paper)
  13. Kay McKinney (3 papers)
  14. Volodymyr Mnih (27 papers)
  15. Alexander Neitz (12 papers)
  16. Fabio Pardo (11 papers)
  17. Jack Parker-Holder (47 papers)
  18. John Quan (15 papers)
  19. Tim Rocktäschel (86 papers)
  20. Himanshu Sahni (8 papers)
Citations (22)
Reddit Logo Streamline Icon: https://streamlinehq.com