OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following (2403.03017v1)
Abstract: Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing LLMs within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288.
- A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pages 706–717. PMLR.
- Harrison Chase. 2022. LangChain.
- Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks. arXiv preprint arXiv:2311.15649.
- Solq: Segmenting objects by learning queries. Advances in Neural Information Processing Systems, 34:21898–21909.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853.
- Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv preprint arXiv:2305.14909.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
- Yuki Inoue and Hiroki Ohashi. 2022. Prompter: Utilizing large language model prompting for a data efficient embodied instruction following. arXiv preprint arXiv:2211.03267.
- Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
- Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753.
- Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. arXiv preprint arXiv:2305.17390.
- Lebp–language expectation & binding policy: A two-stream framework for embodied vision-and-language interaction task learning agents. arXiv preprint arXiv:2203.04637.
- What makes good in-context examples for gpt-3? Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures.
- A planning based neural-symbolic approach for embodied instruction following. Interactions, 9(8):17.
- Clin: A continually learning language agent for rapid task adaptation and generalization. arXiv preprint arXiv:2310.10134.
- Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342.
- Michael Murray and Maya Cakmak. 2022. Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Robotics and Automation Letters, 7(3):6870–6877.
- Look wide and interpret twice: Improving performance on interactive instruction-following tasks. arXiv preprint arXiv:2106.00596.
- Modular framework for visuomotor language grounding. arXiv preprint arXiv:2109.02161.
- R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952.
- Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- James A Sethian. 1996. A fast marching level set method for monotonically advancing fronts. proceedings of the National Academy of Sciences, 93(4):1591–1595.
- Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517.
- Reflexion: Language agents with verbal reinforcement learning.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
- Significant-gravitas et al. 2023. Significant-gravitas/auto-gpt: An experimental open-source attempt to make gpt-4 fully autonomous. https://github.com/Significant-Gravitas/Auto-GPT. Open-Source Software.
- Factorizing perception and policy for interactive instruction following. arXiv preprint arXiv:2012.03208.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
- Embodied bert: A transformer model for embodied, language-guided visual task completion. arXiv preprint arXiv:2108.04927.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
- Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Language models meet world models: Embodied experiences enhance language models. arXiv preprint arXiv:2305.10626.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.
- Haochen Shi (34 papers)
- Zhiyuan Sun (53 papers)
- Xingdi Yuan (46 papers)
- Marc-Alexandre Côté (42 papers)
- Bang Liu (93 papers)