Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation (2401.07314v3)

Published 14 Jan 2024 in cs.AI, cs.CV, and cs.RO

Abstract: Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Bevbert: Topo-metric map pre-training for language-guided navigation. arXiv preprint arXiv:2212.04385, 2022.
  3. Bevbert: Multimodal map pre-training for language-guided navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2737–2748, 2023.
  4. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018a.
  5. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018b.
  6. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33:4247–4258, 2020.
  10. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11276–11286, 2021a.
  11. History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34:5834–5847, 2021b.
  12. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  14. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  15. Evolving graphical planner: Contextual global planning for vision-and-language navigation. Advances in Neural Information Processing Systems, 33:20660–20672, 2020.
  16. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
  17. Visual simultaneous localization and mapping: a survey. Artificial intelligence review, 43(1):55–81, 2015.
  18. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021.
  19. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020.
  20. vlnbert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021.
  21. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  22. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  24. Discuss before moving: Visual language navigation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023.
  25. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035, 2019.
  26. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  27. OpenAI. Gpt-4 technical report, 2023a.
  28. OpenAI. Gpt-4v(ision) system card, 2023b.
  29. OpenAI. Gpt-4v(ision) technical work and authors, 2023c.
  30. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889, 2023.
  31. Object-and-action aware model for visual language navigation. In European Conference on Computer Vision, pages 303–317. Springer, 2020a.
  32. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020b.
  33. Hop: History-and-order aware pre-training for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15418–15427, 2022.
  34. March in chat: Interactive prompting for remote embodied referring expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15758–15767, 2023.
  35. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  36. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. arXiv preprint arXiv:2307.06082, 2023.
  37. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  38. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, 2019.
  39. Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of NAACL-HLT, pages 2610–2621, 2019.
  40. Sebastian Thrun. Learning metric-topological maps for indoor mobile robot navigation. Artificial Intelligence, 99(1):21–71, 1998.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  43. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  44. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6629–6638, 2019.
  45. Scaling data generation in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023b.
  46. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023.
  47. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  48. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986, 2023.
  49. Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12689–12699, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaqi Chen (89 papers)
  2. Bingqian Lin (19 papers)
  3. Ran Xu (89 papers)
  4. Zhenhua Chai (55 papers)
  5. Xiaodan Liang (318 papers)
  6. Kwan-Yee K. Wong (51 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.