Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation (2203.02764v1)

Published 5 Mar 2022 in cs.CV, cs.CL, and cs.RO

Abstract: Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments, training agents that cannot generalize across the two. The fundamental difference between the two setups is that discrete navigation assumes prior knowledge of the connectivity graph of the environment, so that the agent can effectively transfer the problem of navigation with low-level controls to jumping from node to node with high-level actions by grounding to an image of a navigable direction. To bridge the discrete-to-continuous gap, we propose a predictor to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments. We refine the connectivity graph of Matterport3D to fit the continuous Habitat-Matterport3D, and train the waypoints predictor with the refined graphs to produce accessible waypoints at each time step. Moreover, we demonstrate that the predicted waypoints can be augmented during training to diversify the views and paths, and therefore enhance agent's generalization ability. Through extensive experiments we show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions, which reduces the absolute discrete-to-continuous gap by 11.76% Success Weighted by Path Length (SPL) for the Cross-Modal Matching Agent and 18.24% SPL for the Recurrent VLN-BERT. Our agents, trained with a simple imitation learning objective, outperform previous methods by a large margin, achieving new state-of-the-art results on the testing environments of the R2R-CE and the RxR-CE datasets.

Citations (50)

View on Semantic Scholar

Summary

The paper proposes a candidate waypoints predictor to bridge the performance gap for Vision-and-Language Navigation agents trained in discrete environments and deployed in continuous ones.
The predictor uses visual observations and a mixture of Gaussian maps to estimate navigable positions in continuous space, allowing high-level action decisions.
Empirical results show the method improves navigation performance (e.g., 11.76% SPL increase for CMA) in continuous environments, achieving new state-of-the-art on R2R-CE and RxR-CE datasets.

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

The paper addresses a significant challenge in the field of Vision-and-Language Navigation (VLN): the gap between navigation in discrete and continuous environments. Discrete environments, characterized by predefined connectivity graphs such as those found in the Matterport3D (MP3D), allow agents to efficiently navigate using high-level actions—jumping between nodes. In contrast, continuous environments, such as those simulated in Habitat, do not have such predefined structures, forcing agents to infer navigability and execute low-level controls. This discrepancy leads to difficulties in transferability and inefficiency when adapting findings from discrete to continuous domains.

To bridge this gap, the authors propose a candidate waypoints predictor, which estimates navigable positions in continuous environments, effectively allowing an agent trained on high-level actions designed for discrete environments to be adapted to continuous contexts. By deploying a waypoint predictor, the work cleverly emulates the benefits of discrete high-level action decisions, such as rapid learning convergence and effective path planning, within continuous spaces.

The waypoints predictor uses visual observations to infer a local subset of navigable directions, which serves as a dynamic connectivity graph tailored to the agent's current location. Trained on a refined version of the MP3D graph fitted to continuous Habitat environments, the predictor employs a robust mixture of Gaussian probability maps to effectively generate navigable waypoints. The authors highlight the predictor's ability to diversify views and paths by augmenting the predicted waypoints during training, enhancing the agent's generalization capabilities.

Empirical results demonstrate the effectiveness of this approach. When navigating in continuous environments using predicted waypoints—as opposed to relying on inherent low-level control actions—the paper reports a marked reduction in the discrete-to-continuous performance gap. Success Weighted by Path Length (SPL), a critical metric, improved by 11.76% for the Cross-Modal Matching Agent (CMA) and by 18.24% for the VLN $^{\star}$ agent.

This capability to enact high-level actions in continuous spaces not only enhances current agents' applicability but also achieves new state-of-the-art performance on the R2R-CE and RxR-CE datasets. The presented method significantly outperforms previous attempts, including those that address discrete and continuous settings independently, by effectively leveraging high-level action mechanisms in varied environments.

This research implies significant future developments in AI, particularly in embodied AI domains where agents interact with real-world environments. Bridging discrete and continuous navigation paradigms can potentially enhance interaction strategies, improve learning efficiencies, and optimize navigation tasks in varying contexts—ranging from virtual simulations to practical robotic applications. Future directions might explore state-conditioned waypoint generators and adaptive agents that dynamically refine these predictions based on longitudinal learning experiences. Overall, this paper contributes a critical piece towards more realistic and adaptable VLN models.

Related Papers

YouTube

Show All Videos