Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BEVBert: Multimodal Map Pre-training for Language-guided Navigation (2212.04385v2)

Published 8 Dec 2022 in cs.CV, cs.AI, cs.CL, and cs.RO

Abstract: Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dong An (43 papers)
  2. Yuankai Qi (46 papers)
  3. Yangguang Li (44 papers)
  4. Yan Huang (180 papers)
  5. Liang Wang (512 papers)
  6. Tieniu Tan (119 papers)
  7. Jing Shao (109 papers)
Citations (33)