Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning (2310.12921v2)

Published 19 Oct 2023 in cs.LG and cs.AI

Abstract: Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-LLMs (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second "baseline" prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

PDF Abstract

Vision-LLMs as Reward Models for Reinforcement Learning

The paper under discussion investigates the potential of vision-LLMs (VLMs) as zero-shot reward models in reinforcement learning (RL) frameworks. Predominantly, RL relies on explicit reward functions, which may be impractical to define manually, or task-specific learned reward models, often requiring vast amounts of human feedback. This work presents an innovative approach by leveraging pretrained VLMs, specifically CLIP, to serve as general-purpose reward models interpreting tasks from natural language descriptions.

The authors introduce VLM-RM (Vision-LLM as Reward Model), a methodology where VLMs provide reward signals directly from language prompts. Using VLMs like CLIP, which jointly encodes images and accompanying text into the same latent space, the paper demonstrates the application of VLM-RMs using simple language-based instructions to train a MuJoCo humanoid robot to achieve complex tasks. Remarkably, tasks such as kneeling or doing the splits are specified with minimal effort, highlighting the flexibility and efficiency of this approach.

Upon validation through classic RL benchmarks like CartPole and MountainCar, the authors substantiate VLM-RM's ability to correlate its CLIP-sourced reward with environment-specific ground truth rewards. The introduction of "goal-baseline regularization," a technique that refines the reward by isolating the target task features via a secondary baseline prompt, further enhances the reward model's robustness.

One of the most defining features of the paper is its exploration into the scaling effects of VLMs. The research demonstrates a strong positive relationship between the size and data-compute scale of a VLM and its effectiveness as a reward model. This trend was particularly evident in larger models, where the largest CLIP model, ViT-bigG-14, showed superior capability, enabling successful humanoid training, indicative that future larger VLMs may broaden the scope of zero-shot reward modeling across RL domains.

The practical implications of this paper are significant. It proposes a pathway where RL tasks demanding intricate reward structures can instead rely on natural language interfaces, thus emphasizing the scalability of RL systems without arduous manual reward engineering or extensive human input. Significantly, the bottleneck linked with current VLMs, such as spatial reasoning shortcomings, indicates potential directions for future research to address these limitations, improving reward model accuracy and robustness.

Looking ahead, the findings here suggest promising avenues for deploying VLMs in real-world robotics and language-driven AI applications, potentially facilitating more intuitive human-AI interaction paradigms. As computational models evolve, enhancements in VLM architecture and scale are anticipated to deliver increasing utility across a diverse array of RL tasks, driving innovation in autonomous and adaptive systems development. This paper lays a foundational framework, forecasting the influence of language and vision integration in RL while highlighting the importance of continued scaling of AI models.