Vision-Language Models Provide Promptable Representations for Reinforcement Learning (2402.02651v3)

Published 5 Feb 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-LLMs (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings. Finally, we show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.

PDF Abstract

Vision-LLMs Provide Promptable Representations for Reinforcement Learning

This paper presents a methodology to leverage Vision-LLMs (VLMs) as an integral component for initializing reinforcement learning (RL) policies. The central proposition is the use of VLMs as a source of task-specific, promptable representations that are imbued with extensive world knowledge. This approach is evaluated across significantly different and visually complex environments: Minecraft and Habitat.

Methodological Approach

The core of this approach, termed "Promptable Representations for Reinforcement Learning" (PR2L), involves priming VLMs with contextual prompts related to the RL task at hand. The methodology diverges from traditional uses of VLMs in RL, which often either underutilize the knowledge encoded in VLMs by using them as static representations or rudimentarily employ them for instruction following.

Key Contributions:

Task-Relevant Prompt Design: Unlike prior works that might rely on static embeddings or instruction-based queries (e.g., via chain-of-thought reasoning), PR2L uses tailored prompts to elicit VLM responses that encode task-relevant semantic information.
Use of InstructBLIP and Prismatic VLMs: The authors utilize specific VLMs like InstructBLIP and Prismatic, showcasing their utility in integrating visual context and language prompts for downstream RL.
Embodied Decision-Making: By leveraging VLMs, agents can draw on extensive background knowledge, boosting their performance in scenarios that require situational awareness and entity recognition.

Examinative Domains: Minecraft and Habitat

Minecraft:

The authors evaluate PR2L on tasks such as "combat spider," "milk cow," and "shear sheep." These tasks are long-horizon and involve dynamic interactions with various entities in the Minecraft environment. The primary comparison is between PR2L and several baselines:

Non-Promptable VLM Representations: Using general, unprompted image embeddings from VLMs.
Instruction-Following: Directly querying for actions from VLMs.
Model-Based RL: Dreamer v3 algorithm for efficient RL.
Non-Promptable Control-Specific Representations: Using embeddings like VC-1 and R3M.
Domain-Specific Representations: Embeddings fine-tuned on Minecraft data, such as MineCLIP and VPT.

PR2L demonstrated superior performance over non-oracle baselines, validating that task-specific prompts significantly improve the quality of VLM representations for RL. Specifically, PR2L outperformed equivalent policies trained on vision-only embeddings, indicating the efficacy of integrating task context into VLMs for more robust policy learning.

Habitat:

In the Habitat domain, PR2L is applied to the ObjectNav task suite, requiring the agent to navigate household scenes to find specified objects. The evaluation here focuses on:

Generalization Capabilities: Examined via success rates in navigating to objects across various unseen household environments.
Chain-Of-Thought (CoT) Reasoning: Analyzing how CoT responses from VLMs further enrich semantic representations beneficial for RL.

Findings indicate that PR2L with CoT outperformed all baseline methods significantly, achieving a marked improvement in generalization over unseen scenes. Additionally, PR2L exhibited greater sample efficiency compared to models like VC-1, thereby highlighting its practical viability in settings with limited training data.

Implications and Future Directions

Theoretical: This research underscores the importance of utilizing foundational models in RL, not merely as static entities but as dynamic, promptable resources that can draw upon a vast corpus of prior knowledge. It proposes a novel pathway wherein RL can benefit from the semantic depth encoded in VLMs, even when these models are not directly tuned for control tasks.

Practical: By integrating prompts into VLMs, PR2L simplifies the initialization of RL policies across varying tasks, mitigating the need for extensive task-specific pre-training. This approach can significantly economize the training process, particularly in computationally expensive domains like robotics and large-scale simulations.

Future Work:

Automation of Prompt Design: While current prompt designs are hand-engineered, future work could explore algorithmic approaches to optimize prompt selection, potentially leveraging meta-learning frameworks.
Integration with Advanced Models: Future investigations could incorporate more sophisticated models trained on physical interactions, such as diffusion models, to further generalize PR2L’s applicability to physical and dynamic environments.
Computational Efficiency: Given that VLMs can be computationally intensive, exploring more efficient architectures or approximation methods would be beneficial for scalable deployment.

In conclusion, the paper presents a compelling case for the use of VLMs as a versatile and potent resource for enhancing RL policies through promptable, context-driven representations. This approach has far-reaching implications, providing a framework for more intelligent, adaptive, and knowledge-rich RL agents.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

William Chen (49 papers)
Oier Mees (32 papers)
Aviral Kumar (74 papers)
Sergey Levine (531 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/svlevine/status/1757485200106471581