Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation (1811.10092v2)

Published 25 Nov 2018 in cs.CV, cs.AI, cs.CL, and cs.RO

Abstract: Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).

Authors (8)

Xin Wang (1307 papers)
Qiuyuan Huang (23 papers)
Asli Celikyilmaz (81 papers)
Jianfeng Gao (344 papers)
Dinghan Shen (34 papers)
Yuan-Fang Wang (18 papers)
William Yang Wang (254 papers)
Lei Zhang (1689 papers)

Citations (495)

View on Semantic Scholar

Summary

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

The paper "Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation" addresses three pivotal challenges in Vision-Language Navigation (VLN): cross-modal grounding, suboptimal feedback, and generalization problems. The authors introduce a Reinforced Cross-Modal Matching (RCM) approach coupled with a Self-Supervised Imitation Learning (SIL) method to enhance the performance of VLN systems across real 3D environments.

Methodology

The RCM framework combines reinforcement learning (RL) with imitation learning (IL) to facilitate more robust navigation grounded in both visual and linguistic modalities. It chiefly comprises two modules: a reasoning navigator and a matching critic. The reasoning navigator employs an RL-based approach to manage actions based on cross-modal analysis, while the matching critic provides intrinsic rewards by evaluating the alignment between the executed trajectory and the given instructions.

Reasoning Navigator: This component processes real-time visual and textual inputs to inform navigation actions. It employs a combination of history context, visually conditioned textual context, and textually conditioned visual context to predict navigation paths. Here, dot-product attention mechanisms and an LSTM architecture are pivotal in ensuring that the agent aligns its current state with the linguistic instructions.
Matching Critic: The critic delivers intrinsic rewards based on the cycle-reconstruction probability, measuring how well the trajectory aligns with the instructional intent. This probability serves as an intrinsic reward that aids in tuning the navigator's actions beyond what is possible through extrinsic environment feedback alone.

Learning Framework

The proposed learning framework begins with supervised learning to rapidly approximate a baseline policy using demonstration actions. Subsequently, reinforcement learning refines this policy using both extrinsic and intrinsic reward functions. The extrinsic reward is derived from the agent's navigation accuracy and its proximity to the target, while the intrinsic reward stems from the matching critic's evaluation of trajectory-instruction alignment.

Self-Supervised Imitation Learning (SIL): To address the significant disparity between performance in seen and unseen environments, SIL is introduced, allowing the agent to explore new environments autonomously. SIL not only leverages the agent's successful past experiences but also stores optimal trajectories in a replay buffer to continuously refine its policy.

Results

The methodology was evaluated on the Room-to-Room (R2R) dataset, a benchmark for vision-language navigation. Experimental results demonstrate that the RCM model enhances SPL by 10%, establishing new state-of-the-art performance metrics. Furthermore, the use of SIL dramatically reduces the gap in success rates between seen and unseen environments from 30.7% to 11.7%, showcasing its effectiveness in improving generalization and policy efficiency.

Implications and Future Directions

The implications of this research are significant for developing autonomous systems requiring dynamic interaction within their environments, such as in-home robots and personal assistants. The integration of cross-modal reinforcement learning with self-supervised strategies exemplifies a compelling approach to refining agent adaptability in unexplored terrain.

Future research could explore more sophisticated reward mechanisms and alternative self-supervised learning strategies to further bridge the generalization gap inherent in navigation tasks. Moreover, expanding this framework to accommodate more complex instruction and environment contexts could broaden its practical applications, potentially guiding advancements in AI’s situated language understanding and interaction capabilities.

PDF Markdown

Related Papers

Find Related Papers