Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments (2309.04077v4)

Published 8 Sep 2023 in cs.RO and cs.AI

Abstract: Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from LLMs for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on multi-object navigation (MultiON) task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. We also introduce a benchmark dataset for MultiON task employing ProcTHOR framework that provides large photo-realistic indoor environments with variety of objects. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. The code, benchmark dataset and demonstration videos are accessible at https://www.sri.com/ics/computer-vision/saynav.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Ahn, M.; et al. 2022. Do as i can, not as i say: Grounding language in robotic affordances. In abs/2204.01691. ArXiv.
  2. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757.
  3. 3d scene graph: A structure for unified semantics, 3d space, and camera. In In Proceedings of the IEEE/CVF international conference on computer vision.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  5. Brown, T.; et al. 2020b. Language models are few-shot learners. advances in Neural Information Processing Systems, 33: 1877–1901.
  6. Object goal navigation using goal-oriented semantic exploration. advances in Neural Information Processing Systems, 33: 4247–4258.
  7. Cho, K.; et al. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In abs/1406.1078. ArXiv.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. Advances in Neural Information Processing Systems, 35: 5982–5994.
  10. Driess, D.; et al. 2023. Palm-e: An embodied multimodal language model. In abs/2303.03378. ArXiv.
  11. He, K.; et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE.
  12. Huang, W.; et al. 2022. Inner monologue: Embodied reasoning through planning with language models. In abs/2207.05608. ArXiv.
  13. Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization. In Robotics: Science and Systems.
  14. Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14829–14838.
  15. 3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE transactions on cybernetics, 50(12): 4921–4933.
  16. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474.
  17. Lapata, M. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 471–484.
  18. Llm+ p: Empowering large language models with optimal planning proficiency. In abs/2304.11477. ArXiv.
  19. Benchmarking classic and learned navigation in complex 3d environments. In abs/1901.10915. ArXiv.
  20. Ouyang, L.; et al. 2022. Training language models to follow instructions with human feedback. In abs/2203.02155. ArXiv.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  22. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17896–17906.
  23. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5173–5183.
  24. Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research, 40(12–14): 1510–1546.
  25. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings.
  26. ViNT: A Foundation Model for Visual Navigation. arXiv preprint arXiv:2306.14846.
  27. Singh, I.; et al. 2023. Progprompt: Generating situated robot task plans using large language models. In IEEE international conference on robotics and automation (ICRA). IEEE.
  28. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In abs/2212.04088. ArXiv.
  29. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34: 251–266.
  30. Learning 3D semantic scene graphs from 3D indoor reconstructions. In In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  31. Allenact: A framework for embodied ai research. arXiv preprint arXiv:2008.12760.
  32. Wijmans, E.; et al. 2019. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In abs/1911.00357. ArXiv.
  33. SceneGraphFusion: Incremental 3D scene graph prediction from RGB-D sequences. In In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  34. Group normalization. In Proceedings of the European conference on computer vision (ECCV), 3–19.
  35. Leveraging Large Language Models for Visual Target Navigation. In abs/2304.05501. ArXiv.
  36. A survey of large language models. arXiv preprint arXiv:2303.18223.
  37. Zhu, Y.; et al. 2017. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In IEEE international conference on robotics and automation (ICRA). IEEE.
Citations (29)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com