Scaling Data Generation in Vision-and-Language Navigation (2307.15644v2)

Published 28 Jul 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.

Citations (37)

View on Semantic Scholar

Summary

The paper demonstrates that generating extensive, fully connected navigation graphs significantly improves path sampling for VLN tasks.
It employs Co-Modulated GANs to restore visual fidelity, leading to notable improvements in agent navigation success rates.
Empirical results reveal that the proposed methods reduce the generalization gap and achieve up to 80% success on various VLN benchmarks.

Scaling Data Generation in Vision-and-Language Navigation: A Comprehensive Overview

The manuscript "Scaling Data Generation in Vision-and-Language Navigation" explores the pressing issue of data scarcity in Vision-and-Language Navigation (VLN) tasks and presents an extensive strategy for data augmentation to improve the learning process of navigational agents. This document underscores the significance of having diverse visual-language datasets to enhance generalization to unseen environments and introduces a methodological approach leveraging 1200+ environments from HM3D and Gibson datasets, producing a rich corpus of 4.9 million instruction-trajectory pairs.

Methodological Evaluation

The authors address several pivotal sub-problems intrinsic to scaling data generation:

Creation of Navigation Graphs: The paper distinguishes its approach by generating fully connected navigation graphs with extensive coverage of the opened spaces, thus enhancing the feasibility of path sampling within the environments. The graphs are constructed via an excessive viewpoint sampling and aggregation algorithm to meet the expectations of practical scene navigation, a departure from prior works like AutoVLN which suffer from sparse and occasionally impractical graphs.
Image Quality Enhancement: The team's innovative use of Co-Modulated GANs to restore the fidelity of images rendered from the HM3D and Gibson datasets marks a crucial advance in mitigating the visual data noise that could impede agent training. This enhancement shows notably positive effects on agent performance metrics, signifying the GAN's efficacy.
Modeling and Data Utilization: By employing a traditional LSTM-based model for generating navigation instructions, the paper provides robust evidence for the utility of the generated dataset, demonstrating significant improvements in task completion success rates (SR) and navigating in unseen environments.
Empirical Insights: A clear delineation of results is provided, showing that using the proposed ScaleVLN methodology, the SR of agents reaches up to 80% on varied datasets like R2R, CVDN, and REVERIE, compared to previous state-of-the-art methods. This substantial improvement underscores the positive impact of a large-scale and diverse training set coupled with high-fidelity visual data.

Implications and Future Direction

The results reveal a reduction in the generalization gap, dropping to less than 1% between seen and unseen scenes, marking a notable milestone in VLN research. This highlights the potential of using the proposed data-driven strategies not only to elevate baseline performances but also to bring on par agents’ navigation reliability across varied environments.

The deeper implication lies in validating the influence of data scale and diversity on VLN tasks, reinforcing the hypothesis that environments and high-quality sensory inputs significantly shape navigation policies. The prospect of future research lies in further refining navigation graphs and incorporating advanced models for path prediction and instruction generation that could leverage the peculiarities offered by these varied environments.

Moreover, the paper insinuates extensions into continuous navigation learning where discrete data can lead to success, an exciting avenue aiming to bridge simulation-real world discrepancies with AI capable of nuanced environmental interaction.

In conclusion, "Scaling Data Generation in Vision-and-Language Navigation" offers a methodologically sound and thoroughly evaluated approach to beyond-current-state datasets for VLN, suggesting clear paths for the academic and applied progression of AI navigation tasks. The synergy between technical augmentation and empirical analysis fortifies the paper's contributions to the field of Artificial Intelligence and Robotics.

PDF Markdown

Related Papers

GitHub

GitHub - wz0919/ScaleVLN: [ICCV 2023 Oral]: Scaling Data Generation in Vision-and-Language Navigation (179 stars)