LanGWM: Language Grounded World Model (2311.17593v1)

Published 29 Nov 2023 in cs.LG, cs.AI, cs.CL, cs.CV, and cs.RO

Abstract: Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the LLMs, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces LanGWM to integrate textual prompts with masked visual features for enhanced out-of-distribution generalization in RL.
It employs a transformer-based masked autoencoder to predict masked objects and pixel reconstructions guided by language descriptions.
Experiments on iGibson PointGoal navigation demonstrate state-of-the-art performance at 100K interaction steps.

Introduction

Reinforcement learning (RL) models have demonstrated significant achievements across a variety of domains. One such domain involves visual control tasks, where an agent is expected to interpret visual data to navigate and perform actions within its environment. However, these models tend to struggle with generalizing to situations that were not observed during the training phase, known as out-of-distribution (OoD) conditions.

Addressing OoD Generalization with Language Grounding

A novel approach, named Language Grounded World Model (LanGWM), has been proposed to address this challenge by incorporating language to ground visual perceptions and reinforce the learning process. Unlike images, language offers an abstract way to describe high-level concepts and contexts, capturing subtleties that might be used for better generalizing across different visual scenarios.

LanGWM specifically aims to enhance model-based RL, where a model of the world is learned and used for planning. This technique harnesses language prompts that correspond to masked regions in images, theorizing that semantically similar concepts in language can assist in generalizing when encountering OoD examples. Essentially, the model is trained to predict masked objects and the associated pixel reconstruction based on text descriptions, thereby establishing a synergy between visual features and language.

Architecture and Experimental Setup

The LanGWM architecture consists of three interlinked components: unsupervised language-grounded representation learning, a world model capable of predicting future environment states, and a controller that leverages these predictions to optimize actions.

During the experiments, a modified transformer-based masked autoencoder, designed for feature abstraction from visual observations, predicts environmental states and rewards from text prompts. LanGWM's performance was evaluated on the iGibson 1.0 environment, specifically focusing on PointGoal navigation tasks to test OoD generalization.

Results and Conclusion

The LanGWM achieved state-of-the-art performance when tested for OoD generalization at 100K interaction steps in the iGibson navigation tasks. Notably, the incorporation of explicit language-grounded visual representation learning significantly enhanced the model's robustness in navigating new environments and textures not encountered during training.

In summary, LanGWM exemplifies how integrating language with visual inputs can bolster RL models particularly in handling OoD scenarios, paving the way for more sophisticated and generalizable AI systems in the field of autonomous systems and human-robot interaction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RudraPoudel/status/1759178716582654209