Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-modal Map Learning for Vision and Language Navigation (2203.05137v3)

Published 10 Mar 2022 in cs.CV and cs.RO

Abstract: We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Georgios Georgakis (19 papers)
  2. Karl Schmeckpeper (19 papers)
  3. Karan Wanchoo (2 papers)
  4. Soham Dan (41 papers)
  5. Eleni Miltsakaki (4 papers)
  6. Dan Roth (222 papers)
  7. Kostas Daniilidis (119 papers)
Citations (52)

Summary

We haven't generated a summary for this paper yet.