Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following (2403.03017v1)

Published 5 Mar 2024 in cs.AI

Abstract: Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing LLMs within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654.
  2. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288.
  3. A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pages 706–717. PMLR.
  4. Harrison Chase. 2022. LangChain.
  5. Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks. arXiv preprint arXiv:2311.15649.
  6. Solq: Segmenting objects by learning queries. Advances in Neural Information Processing Systems, 34:21898–21909.
  7. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  8. Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853.
  9. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv preprint arXiv:2305.14909.
  10. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969.
  11. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR.
  12. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
  13. Yuki Inoue and Hiroki Ohashi. 2022. Prompter: Utilizing large language model prompting for a data efficient embodied instruction following. arXiv preprint arXiv:2211.03267.
  14. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
  15. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753.
  16. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. arXiv preprint arXiv:2305.17390.
  17. Lebp–language expectation & binding policy: A two-stream framework for embodied vision-and-language interaction task learning agents. arXiv preprint arXiv:2203.04637.
  18. What makes good in-context examples for gpt-3? Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures.
  19. A planning based neural-symbolic approach for embodied instruction following. Interactions, 9(8):17.
  20. Clin: A continually learning language agent for rapid task adaptation and generalization. arXiv preprint arXiv:2310.10134.
  21. Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342.
  22. Michael Murray and Maya Cakmak. 2022. Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Robotics and Automation Letters, 7(3):6870–6877.
  23. Look wide and interpret twice: Improving performance on interactive instruction-following tasks. arXiv preprint arXiv:2106.00596.
  24. Modular framework for visuomotor language grounding. arXiv preprint arXiv:2109.02161.
  25. R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  26. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952.
  27. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135.
  28. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
  29. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  30. James A Sethian. 1996. A fast marching level set method for monotonically advancing fronts. proceedings of the National Academy of Sciences, 93(4):1591–1595.
  31. Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517.
  32. Reflexion: Language agents with verbal reinforcement learning.
  33. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
  34. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
  35. Significant-gravitas et al. 2023. Significant-gravitas/auto-gpt: An experimental open-source attempt to make gpt-4 fully autonomous. https://github.com/Significant-Gravitas/Auto-GPT. Open-Source Software.
  36. Factorizing perception and policy for interactive instruction following. arXiv preprint arXiv:2012.03208.
  37. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
  38. Embodied bert: A transformer model for embodied, language-guided visual task completion. arXiv preprint arXiv:2108.04927.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  40. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
  41. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997.
  42. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  43. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  44. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  45. Language models meet world models: Embodied experiences enhance language models. arXiv preprint arXiv:2305.10626.
  46. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  47. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haochen Shi (34 papers)
  2. Zhiyuan Sun (53 papers)
  3. Xingdi Yuan (46 papers)
  4. Marc-Alexandre Côté (42 papers)
  5. Bang Liu (93 papers)
Citations (5)