Mapping Instructions and Visual Observations to Actions with Reinforcement Learning (1704.08795v2)

Published 28 Apr 2017 in cs.CL

Abstract: We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.

Authors (3)

Dipendra Misra (34 papers)
John Langford (94 papers)
Yoav Artzi (51 papers)

Citations (242)

View on Semantic Scholar

Summary

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

This paper addresses the challenging problem of executing natural language instructions in environments perceived through raw visual data. Traditional methods generally rely on structured environment representations or an orchestration of separately trained models for tasks such as language understanding and visual reasoning. In contrast, this work proposes an integrated approach by learning a single model capable of converting both linguistic inputs and visual observations directly into actions.

Methodology

The formulated approach does not assume any intermediate representation layers, planning procedures, or multiple model training phases. This sets it apart from established methods that employ pipelines with individually trained models. There is a focus on reinforcement learning (RL) with a particular emphasis on training a neural network agent in a contextual bandit setting. The RL agent's exploration is guided using reward shaping strategies that incorporate varying levels of supervision, ranging from full demonstrations to goal-state annotations.

The block world environment employed serves as the test domain where each instruction involves moving a single block. The agent's task is to predict a sequence of actions that correspond with the instruction given the RGB image of the initial state. The model must jointly reason about visual inputs and linguistic instructions to select appropriate actions at each step.

Numerical Results and Analysis

Through experimentation, the proposed model significantly outperforms baseline methods that depend solely on supervised learning or standard reinforcement learning techniques. Within the block world environment, the RL-driven approach exhibits a 24% reduction in execution error relative to basic supervised learning baselines and a 34-39% reduction compared to other common RL variants.

The paper highlights the impact of policy gradient methods combined with well-designed reward shaping in achieving effective exploration and successful instruction execution. By comparing with deep Q-learning and narrative-driven reinforcement learning (REINFORCE) baselines, the richness of the exploration space and the importance of efficient reward representation are emphasized.

Theoretical and Practical Implications

The implications of this research are both pragmatic and theoretical. Practically, it exemplifies a direction of reducing reliance on hand-engineered solutions and encourages the application of unified learning models for complex task environments involving natural language. On a theoretical level, it invites further exploration into integrating semantic parsing with neural architectures to improve instruction-following capabilities and generalization in noisy, real-world environments.

Future Developments

While the accomplishments of this model in raw visual contexts are significant, future research could investigate extending the approach to support more complex, multi-step instructions, improving robustness against varied natural language expressions, and scaling the model to diverse real-world settings. Attention mechanisms or memory-augmented networks could be examined for their potential to handle extended sequences of instructions and improve upon the single-step attention underlying current architectures.

The comprehensive exploration of combined reward shaping with contextual bandit reinforcement learning showcases the potential for this approach to elevate the field of instruction-following agents, pointing towards broader applications in AI-driven robotics and autonomous systems.

PDF Markdown

Related Papers

GitHub

GitHub - lil-lab/blocks: Blocks World -- Simulator, Code, and Models (Misra et al. EMNLP 2017) (40 stars)

YouTube

Show All Videos