Vision-Language Models as a Source of Rewards (2312.09187v3)

Published 14 Dec 2023 in cs.LG

Abstract: Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-LLMs, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

PDF HTML Abstract

Introduction to Vision-LLMs as Rewards

The ambition of creating versatile AI agents that can navigate and accomplish objectives within complex environments is a major focus in the field of reinforcement learning (RL). A substantial obstacle in this area is the necessity for diverse reward functions to train agents for various goals. The paper scrutinizes using vision-LLMs (VLMs) as a new method for generating rewards in reinforcement learning. Specifically, it looks at pre-trained VLMs, like CLIP, to produce reward signals that do not require further fine-tuning with environment-specific data. The method is demonstrated in two different visual domains, and the results indicate that larger VLMs provide more accurate rewards, leading to more effective RL agents.

Related Work and Methodological Foundations

There is a recent research interest in using VLMs for creating reward functions. Pre-trained VLMs have already displayed proficiency in tasks such as visual detection, classification, and question-answering. The paper outlines efforts where CLIP-based models have been fine-tuned with video and text from Minecraft to develop effective shaping rewards, allowing agents to perform specific tasks more efficiently.

The methodology proposed involves using contrastive VLMs to produce a straightforward binary reward for RL. This process creates an image encoder and a text encoder to generate a reward signal from environmental observations and text-based goals. The rewards serve as indicators for the achievement of defined goals within a partially observable Markov decision process (POMDP), leveraging the intrinsic reward rather than relying on explicitly programmed ground truth rewards.

Empirical Evaluations and Results

In the empirical paper, experiments assess how the use of VLM rewards correlates with the fundamental ground truth rewards and explore the influence of scaling up the VLM. Key research questions include determining if optimizing the VLM reward leads to higher ground truth rewards, and whether larger VLMs enhance the performance of the reward function.

The experimental setup mimics standard online RL, using environments such as Playhouse and AndroidEnv to challenge the agent with tasks like locating objects or opening apps. The essential finding from these experiments is that training agents to maximize VLM-derived rewards concurrently maximizes the actual ground truth reward. Moreover, increasing the size of VLM models improves both their accuracy in offline settings and their effectiveness as a reward signal during RL training.

Conclusion and Practical Implications

The paper demonstrates that pre-existing VLMs can provide precise rewards for visual tasks based on language goals. When the scale of VLMs is increased, the accuracy of reward predictions also improves significantly, which, in succession, leads to better-performing RL agents. These findings suggest that as VLMs continue to evolve, it might become feasible to train generalized agents in visually-rich settings without the need for additional fine-tuning, a step forward in creating more adaptable and capable AI systems.