EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning (2410.23968v1)
Abstract: Recent advances in LLMs have helped facilitate exciting progress for robotic planning in real, open-world environments. 3D scene graphs (3DSGs) offer a promising environment representation for grounding such LLM-based planners as they are compact and semantically rich. However, as the robot's environment scales (e.g., number of entities tracked) and the complexity of scene graph information increases (e.g., maintaining more attributes), providing the 3DSG as-is to an LLM-based planner quickly becomes infeasible due to input token count limits and attentional biases present in LLMs. Inspired by the successes of Retrieval-Augmented Generation (RAG) methods that retrieve query-relevant document chunks for LLM question and answering, we adapt the paradigm for our embodied domain. Specifically, we propose a 3D scene subgraph retrieval framework, called EmbodiedRAG, that we augment an LLM-based planner with for executing natural language robotic tasks. Notably, our retrieved subgraphs adapt to changes in the environment as well as changes in task-relevancy as the robot executes its plan. We demonstrate EmbodiedRAG's ability to significantly reduce input token counts (by an order of magnitude) and planning time (up to 70% reduction in average time per planning step) while improving success rates on AI2Thor simulated household tasks with a single-arm, mobile manipulator. Additionally, we implement EmbodiedRAG on a quadruped with a manipulator to highlight the performance benefits for robot deployment at the edge in real environments.
- 3D scene graph: A structure for unified semantics, 3D space, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5664–5673, 2019.
- CLIO: Real-time task-driven open-set 3D scene graphs. IEEE Robotics and Automation Letters, 9(10):8921–8928, 2024.
- Conceptgraphs: Open-vocabulary 3D scene graphs for perception and planning. In IEEE International Conference on Robotics and Automation, pages 5021–5028. IEEE, 2024.
- 3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans. arXiv preprint arXiv:2002.06289, 2020.
- Do as I can, not as I say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
- Sayplan: Grounding large language models using 3D scene graphs for scalable robot task planning. In 7th Annual Conference on Robot Learning, 2023.
- Generalized planning in PDDL domains with pretrained large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20256–20264, 2024.
- LLM+ P: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR, 2023.
- P. Oh and J. Thorne. Detrimental contexts in open-domain question answering. arXiv preprint arXiv:2310.18077, 2023.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
- Retrieval-augmented generation for large language models: A survey, 2024.
- How fielders arrive in time to catch the ball. Nature, 426(6964):244–245, 2003.
- G. Gigerenzer. Gut feelings: The intelligence of the unconscious. New York: Viking, 2007.
- Connecting perceptual and procedural abstractions in physical construction. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 43, 2021.
- Rational simplification and rigidity in human planning. Psychological Science, 34(11):1281–1292, 2023.
- V. Pacelli and A. Majumdar. Learning Task-Driven Control Policies via Information Bottlenecks. In Proceedings of Robotics: Science and Systems, 2020.
- S. Soatto and A. Chiuso. Modeling visual representations: Defining properties and deep approximations. In Y. Bengio and Y. LeCun, editors, International Conference on Learning Representations, 2016.
- Concise planning and filtering: Hardness and algorithms. IEEE Transactions on Automation Science and Engineering, 14(4):1666–1681, 2017.
- M. Booker and A. Majumdar. Learning to actively reduce memory requirements for robot control tasks. In Learning for Dynamics and Control, pages 125–137. PMLR, 2021.
- Towards a unified theory of state abstraction for MDPs. AI&M, 1(2):3, 2006.
- Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021.
- M. Booker and A. Majumdar. Switching attention in time-varying environments via Bayesian inference of abstractions. In IEEE International Conference on Robotics and Automation, pages 10174–10180. IEEE, 2023.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023.
- Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023.
- From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024.
- N. Reimers. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Context-aware entity grounding with open-vocabulary 3D scene graphs. In Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, pages 1950–1974. PMLR, 2023.
- Planning with learned object importance in large problem instances using graph neural networks. In AAAI Conference on Artificial Intelligence, volume 35, pages 11962–11971, 2021.
- Taskography: Evaluating robot task planning over large 3D scene graphs. In Conference on Robot Learning, 2021.
- Chroma. chroma. https://github.com/chroma-core/chroma.
- The FAISS library. arXiv preprint arXiv:2401.08281, 2024.
- Langchain. langchain. https://github.com/langchain-ai/langchain, a.
- Langchain. Self-querying. https://python.langchain.com/docs/how_to/self_query/, b.
- AI2-Thor: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.