Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Less is More: Generating Grounded Navigation Instructions from Landmarks (2111.12872v4)

Published 25 Nov 2021 in cs.CV and cs.CL

Abstract: We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 971k English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71% following MARKY-MT5's instructions, just shy of their 75% SR following human instructions -- and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Su Wang (66 papers)
  2. Ceslee Montgomery (4 papers)
  3. Jordi Orbay (4 papers)
  4. Vighnesh Birodkar (16 papers)
  5. Aleksandra Faust (60 papers)
  6. Izzeddin Gur (23 papers)
  7. Natasha Jaques (32 papers)
  8. Austin Waters (10 papers)
  9. Jason Baldridge (45 papers)
  10. Peter Anderson (30 papers)
Citations (55)