- The paper introduces LanGWM to integrate textual prompts with masked visual features for enhanced out-of-distribution generalization in RL.
- It employs a transformer-based masked autoencoder to predict masked objects and pixel reconstructions guided by language descriptions.
- Experiments on iGibson PointGoal navigation demonstrate state-of-the-art performance at 100K interaction steps.
Introduction
Reinforcement learning (RL) models have demonstrated significant achievements across a variety of domains. One such domain involves visual control tasks, where an agent is expected to interpret visual data to navigate and perform actions within its environment. However, these models tend to struggle with generalizing to situations that were not observed during the training phase, known as out-of-distribution (OoD) conditions.
Addressing OoD Generalization with Language Grounding
A novel approach, named Language Grounded World Model (LanGWM), has been proposed to address this challenge by incorporating language to ground visual perceptions and reinforce the learning process. Unlike images, language offers an abstract way to describe high-level concepts and contexts, capturing subtleties that might be used for better generalizing across different visual scenarios.
LanGWM specifically aims to enhance model-based RL, where a model of the world is learned and used for planning. This technique harnesses language prompts that correspond to masked regions in images, theorizing that semantically similar concepts in language can assist in generalizing when encountering OoD examples. Essentially, the model is trained to predict masked objects and the associated pixel reconstruction based on text descriptions, thereby establishing a synergy between visual features and language.
Architecture and Experimental Setup
The LanGWM architecture consists of three interlinked components: unsupervised language-grounded representation learning, a world model capable of predicting future environment states, and a controller that leverages these predictions to optimize actions.
During the experiments, a modified transformer-based masked autoencoder, designed for feature abstraction from visual observations, predicts environmental states and rewards from text prompts. LanGWM's performance was evaluated on the iGibson 1.0 environment, specifically focusing on PointGoal navigation tasks to test OoD generalization.
Results and Conclusion
The LanGWM achieved state-of-the-art performance when tested for OoD generalization at 100K interaction steps in the iGibson navigation tasks. Notably, the incorporation of explicit language-grounded visual representation learning significantly enhanced the model's robustness in navigating new environments and textures not encountered during training.
In summary, LanGWM exemplifies how integrating language with visual inputs can bolster RL models particularly in handling OoD scenarios, paving the way for more sophisticated and generalizable AI systems in the field of autonomous systems and human-robot interaction.