Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation (1905.12255v3)

Published 29 May 2019 in cs.AI and cs.CL

Abstract: Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation(VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on goal completion rather than the sequence of actions corresponding to the instructions. Here, we highlight shortcomings of current metrics for the Room-to-Room dataset (Anderson et al.,2018b) and propose a new metric, Coverage weighted by Length Score (CLS). We also show that the existing paths in the dataset are not ideal for evaluating instruction following because they are direct-to-goal shortest paths. We join existing short paths to form more challenging extended paths to create a new data set, Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.

Citations (164)

View on Semantic Scholar

Summary

The paper addresses the lack of instruction fidelity evaluation in Vision-and-Language Navigation (VLN) and proposes new methods to improve it.
It introduces Coverage weighted by Length Score (CLS) as a new metric to better evaluate how well an agent adheres to the entire path specified in instructions, unlike traditional goal-focused metrics.
The authors propose the Room-for-Room (R4R) dataset, an extension of R2R with longer, more complex paths, demonstrating that agents trained with instruction fidelity rewards perform better.

The research paper titled "Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation" presents a significant advancement in the field of Vision-and-Language Navigation (VLN). The primary focus of VLN is to develop agents capable of interpreting natural language instructions and corresponding visual scenes to navigate through environments efficiently. Despite notable progress in VLN, the current evaluation metrics largely emphasize goal achievement rather than adherence to the specific sequence of actions outlined in the instructions. This paper addresses these shortcomings and proposes new methodologies to enhance instruction fidelity in VLN tasks.

Advances in Metrics

The traditional metrics used for VLN, such as Success Rate (SR) and Success weighted by Path Length (SPL), are primarily concerned with whether the agent reaches its goal location. As a result, these metrics often neglect how well the agent follows the path specified by the instructions. The paper critiques these conventional metrics for failing to account for instruction fidelity and introduces a new metric named Coverage weighted by Length Score (CLS). CLS takes into consideration the closeness of an agent's trajectory to the entire reference path, thus providing a more comprehensive evaluation of how well an agent adheres to given instructions.

Room-for-Room (R4R) Dataset

The limitations of the Room-to-Room (R2R) dataset, which comprises direct-to-goal shortest paths, challenged the evaluation of instructional adherence as these paths do not require complex navigation decisions. The authors propose the Room-for-Room (R4R) dataset, which algorithmically extends R2R paths by joining them to create longer and more challenging paths. This extended dataset offers better opportunities to assess an agent's ability to follow instructions by introducing twists and turns rather than direct routes.

Agent Performance and Instruction Adherence

Using R4R, the paper demonstrates that agents receiving rewards for instruction fidelity outperform those centered solely on goal completion. In particular, the Reinforced Cross-modal Matching (RCM) models, trained with instruction fidelity as a reward signal, showed marked improvement in both CLS scores and navigation error. This underlines the importance of incorporating instruction adherence into VLN evaluation metrics and agent training processes.

Implications for Future Research

The findings of this paper hold significant theoretical and practical implications for the development of VLN systems and the broader domain of AI navigation tasks. By highlighting the necessity of instruction fidelity, the research encourages a reevaluation of how agents should be trained and evaluated. The adoption of metrics like CLS and datasets like R4R could lead to improved interaction between humans and AI, especially in scenarios requiring precise adherence to detailed instructions.

Overall, "Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation" provides a profound insight into enhancing VLN tasks by ensuring that language instructions are given their due importance, thereby presenting a critical perspective for future developments in the AI navigation field.