History Aware Multimodal Transformer for Vision-and-Language Navigation (2110.13309v2)

Published 25 Oct 2021 in cs.CV and cs.AI

Abstract: Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories.

PDF Abstract

History Aware Multimodal Transformer for Vision-and-Language Navigation

The paper "History Aware Multimodal Transformer for Vision-and-Language Navigation" presents an innovative approach to vision-and-language navigation (VLN) by introducing a History Aware Multimodal Transformer (HAMT). The principal aim of VLN tasks is to train autonomous agents capable of interpreting instructions and navigating through complex environments. While conventional methods predominantly employ recurrent states to track histories, HAMT leverages a robust transformer architecture to incorporate comprehensive historical context into its decision-making process.

Methodological Advancements

HAMT distinguishes itself by encoding a long-horizon history through a hierarchical vision transformer (ViT). This technique encodes panoramic observations by initially abstracting individual images, interpreting spatial relations between them within panorama frames, and subsequently integrating these transformations into temporal sequences. This historical depth is jointly processed with textual instructions and current visual observations to project the next best navigation action.

The training regimen for HAMT is bifurcated. Initially, the model undergoes supervised training using various auxiliary proxy tasks such as single-step action prediction and spatial relationship inference. These tasks are designed to stabilize the model's performance, given the vast search space and sparse rewards typical in reinforcement learning environments. Reinforcement learning is then employed to fine-tune HAMT to optimize navigation policies, ensuring robust decision-making across diverse scenarios.

Quantitative and Qualitative Performance

HAMT achieves state-of-the-art results across several VLN benchmarks, especially excelling in complex long-horizon tasks. Specifically, it outperforms prior models in datasets like R2R, RxR, R4R, and newly proposed formats such as R2R-Back. It exhibits notably high success rates and efficient path planning, credited to its ability to reconstruct and utilize extensive past observations effectively. The model performs exceptionally well in navigation with high-level instructions, where detailed step-by-step guidance is diminished, demonstrating its efficacy in generalizing past learning to new, unseen environments.

The paper highlights an approximate 6-10% improvement in SPL metrics compared to contemporary baselines. This improvement can be attributed to HAMT’s nuanced approach to modal fusion and the inclusion of extensive historical context through its hierarchical ViT architecture.

Implications and Future Directions

The proposed innovations in HAMT extend beyond immediate VLN tasks, potentially impacting broader AI fields that integrate sequence-based decision making, such as autonomous driving, robotics, and resource-constrained exploration domains. HAMT's hierarchical encoding paves the way for comprehensive historical context integration in diverse AI systems, leading to more refined and context-aware autonomous agents.

Future work could leverage larger VLN datasets for further pretraining initiatives, enhancing HAMT’s generalization capabilities. Additionally, integrating continuous action spaces, a present limitation due to the discrete decision model, could represent a formidable advancement for application versatility, enabling the transition from predetermined path-based navigation to more fluid and adaptive real-world interactions.

In summary, HAMT represents a substantive step forward in multimodal transformers' application to autonomous navigation tasks, pushing the boundaries for what current systems can achieve in terms of interpretative depth and navigational precision.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shizhe Chen (52 papers)
Pierre-Louis Guhur (6 papers)
Cordelia Schmid (206 papers)
Ivan Laptev (99 papers)

Citations (189)

View on Semantic Scholar

History Aware Multimodal Transformer for Vision-and-Language Navigation (2110.13309v2)