SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments (2309.04077v4)
Abstract: Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from LLMs for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on multi-object navigation (MultiON) task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. We also introduce a benchmark dataset for MultiON task employing ProcTHOR framework that provides large photo-realistic indoor environments with variety of objects. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. The code, benchmark dataset and demonstration videos are accessible at https://www.sri.com/ics/computer-vision/saynav.
- Ahn, M.; et al. 2022. Do as i can, not as i say: Grounding language in robotic affordances. In abs/2204.01691. ArXiv.
- On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757.
- 3d scene graph: A structure for unified semantics, 3d space, and camera. In In Proceedings of the IEEE/CVF international conference on computer vision.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Brown, T.; et al. 2020b. Language models are few-shot learners. advances in Neural Information Processing Systems, 33: 1877–1901.
- Object goal navigation using goal-oriented semantic exploration. advances in Neural Information Processing Systems, 33: 4247–4258.
- Cho, K.; et al. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In abs/1406.1078. ArXiv.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. Advances in Neural Information Processing Systems, 35: 5982–5994.
- Driess, D.; et al. 2023. Palm-e: An embodied multimodal language model. In abs/2303.03378. ArXiv.
- He, K.; et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE.
- Huang, W.; et al. 2022. Inner monologue: Embodied reasoning through planning with language models. In abs/2207.05608. ArXiv.
- Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization. In Robotics: Science and Systems.
- Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14829–14838.
- 3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE transactions on cybernetics, 50(12): 4921–4933.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474.
- Lapata, M. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 471–484.
- Llm+ p: Empowering large language models with optimal planning proficiency. In abs/2304.11477. ArXiv.
- Benchmarking classic and learned navigation in complex 3d environments. In abs/1901.10915. ArXiv.
- Ouyang, L.; et al. 2022. Training language models to follow instructions with human feedback. In abs/2203.02155. ArXiv.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Pirlnav: Pretraining with imitation and rl finetuning for objectnav. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17896–17906.
- Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5173–5183.
- Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research, 40(12–14): 1510–1546.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings.
- ViNT: A Foundation Model for Visual Navigation. arXiv preprint arXiv:2306.14846.
- Singh, I.; et al. 2023. Progprompt: Generating situated robot task plans using large language models. In IEEE international conference on robotics and automation (ICRA). IEEE.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In abs/2212.04088. ArXiv.
- Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34: 251–266.
- Learning 3D semantic scene graphs from 3D indoor reconstructions. In In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Allenact: A framework for embodied ai research. arXiv preprint arXiv:2008.12760.
- Wijmans, E.; et al. 2019. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In abs/1911.00357. ArXiv.
- SceneGraphFusion: Incremental 3D scene graph prediction from RGB-D sequences. In In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), 3–19.
- Leveraging Large Language Models for Visual Target Navigation. In abs/2304.05501. ArXiv.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Zhu, Y.; et al. 2017. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In IEEE international conference on robotics and automation (ICRA). IEEE.