Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks (2411.00081v1)

Published 31 Oct 2024 in cs.RO and cs.AI

Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using LLMs, incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Evaluating multi-agent coordination abilities in large language models. arXiv, 2023.
  2. Taskography: Evaluating robot task planning over large 3d scene graphs. In CoRL, 2022.
  3. Do as i can and not as i say: Grounding language in robotic affordances. In CoRL, 2022.
  4. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
  5. BostonDynamics. Spot robot. https://www.bostondynamics.com/products/spot. Accessed: 2024-10-01.
  6. On the utility of learning about humans for human-ai coordination. In NeurIPS, 2019.
  7. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In CVPR, 2019.
  8. Embodied question answering. In CVPR, 2018.
  9. Palm-e: An embodied multimodal language model. In ICML, 2023.
  10. The llama 3 herd of models. arXiv, 2024.
  11. Grammar-constrained decoding for structured NLP tasks without finetuning. In EMNLP, 2023.
  12. Pddl - the planning domain definition language. Technical Report, 1998.
  13. Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. In ICCV, 2023.
  14. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In ICRA, 2024.
  15. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv, 2023.
  16. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  17. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022.
  18. Inner monologue: Embodied reasoning through planning with language models. In CoRL, 2023.
  19. Two body problem: Collaborative visual task completion. In CVPR, 2019. first two authors contributed equally.
  20. A cordial sync: Going beyond marginal policies for multi-agent embodied tasks. In ECCV, 2020.
  21. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020.
  22. Gen2sim: Scaling up robot learning in simulation with generative models. arXiv, 2023.
  23. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In CVPR, 2024.
  24. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In ECCV, 2020.
  25. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In EMNLP, 2020.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020.
  27. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In CoRL, 2023a.
  28. Camel: Communicative agents for "mind" exploration of large language model society. In NeurIPS, 2023b.
  29. Pre-trained language models for interactive decision-making. In NeurIPS, 2022.
  30. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2024.
  31. Openeqa: Embodied question answering in the era of foundation models. In CVPR, 2024.
  32. Roco: Dialectic multi-robot collaboration with large language models. In ICRA, 2024.
  33. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In RSS, 2024.
  34. Iterative reasoning preference optimization. arXiv, 2024.
  35. Generative agents: Interactive simulacra of human behavior. In ACM Symposium on User Interface Software and Technology, 2023.
  36. Virtualhome: Simulating household activities via programs. In CVPR, 2018.
  37. Watch-and-help: A challenge for social perception and human-{ai} collaboration. In ICLR, 2021.
  38. Nopa: Neurally-guided online probabilistic assistance for building socially intelligent home assistants. In ICRA, 2023.
  39. Habitat 3.0: A co-habitat for humans, avatars and robots. In ICLR, 2024.
  40. Team PyTorch. Accelerating generative ai with pytorch ii: Gpt, fast. https://pytorch.org/blog/accelerating-generative-ai-2/, 2023.
  41. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In CoRL, 2023.
  42. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
  43. Code llama: Open foundation models for code. arXiv, 2023.
  44. Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2023.
  45. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR, 2020.
  46. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In ICLR, 2021.
  47. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In CoRL, 2022.
  48. Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents. arXiv, 2019.
  49. Habitat 2.0: Training home assistants to rearrange their habitat. In NeurIPS, 2021.
  50. Adaptive coordination in social embodied rearrangement. In ICML, 2023.
  51. Large language models as generalizable policies for embodied tasks. In ICLR, 2024.
  52. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. In ICML, 2024.
  53. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. In RSS, 2024.
  54. Towards a foundation model for generalist robots: Diverse skill learning at scale via automated task and scene generation. arXiv, 2023.
  55. Language models meet world models: Embodied experiences enhance language models. In NeurIPS, 2023.
  56. React: Synergizing reasoning and acting in language models. In ICLR, 2023.
  57. Homerobot: Open-vocabulary mobile manipulation. In CoRL, 2023.
  58. Asc: Adaptive skill coordination for robotic mobile manipulation. IEEE Robotics and Automation Letters, 2023.
  59. Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2022.
  60. Building cooperative embodied agents modularly with large language models. ICLR, 2024a.
  61. Combo: Compositional world models for embodied multi-agent cooperation. arXiv, 2024b.
  62. Large language models as commonsense knowledge for large-scale task planning. In NeurIPS, 2024.
  63. Sotopia: Interactive evaluation for social intelligence in language agents. In ICLR, 2024.
  64. Excalibur: Encouraging and evaluating embodied exploration. In CVPR, 2023.

Summary

  • The paper presents PARTNR, a benchmark that assesses embodied AI planning and reasoning over 100,000 tasks in detailed household environments.
  • It employs a novel LLM-driven, simulation-in-the-loop approach to generate diverse, realistic tasks and evaluation functions while reducing common errors.
  • Findings indicate that state-of-the-art LLMs underperform compared to human-avatar collaborations, suggesting that fine-tuned smaller models offer efficient alternatives.

Evaluation of PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-Agent Tasks

The paper "PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-Agent Tasks" presents a comprehensive benchmark aimed at evaluating and advancing the capabilities of embodied AI agents in collaborative human-robot scenarios. PARTNR stands as a significant contribution to the field by providing a diverse and large-scale dataset designed to test the planning and reasoning abilities of AI agents over a range of household tasks that require both spatial and temporal understanding.

The benchmark consists of 100,000 unique natural language tasks set within 60 intricately modeled houses, containing 5,819 distinct objects, thereby offering a rich environment for studying embodied AI. It categorizes tasks into four main types: constraint-free, spatial, temporal, and heterogeneous, with a combination of these characteristics leading to complex task scenarios that necessitate effective collaboration between agents. The diverse task set challenges agents to reason beyond simple navigation and object manipulation, requiring dynamic interaction and collaboration in partially observable environments.

The authors have introduced a novel method for large-scale task generation using LLMs, integrated with a simulation-in-the-loop mechanism for grounding. This approach facilitates the reduction of errors such as hallucinations or conceptually infeasible tasks, paving the way for efficient generation of realistic yet challenging task instructions that call for creative problem-solving from AI agents. The evaluation functions accompanying these tasks are similarly generated, leveraging LLMs to ensure they capture the complexity and nuances of real-world task success criteria.

One of the significant findings of this paper lies in the limitations revealed in the performance of current state-of-the-art LLMs on collaborative tasks. Despite recent advances in natural language processing and AI, these models struggle considerably with planning in decentralized multi-agent settings, often requiring more time than their human counterparts or even single-agent systems to complete tasks. Specifically, the authors report that human-avatar collaborations solve 93% of tasks efficiently, whereas LLMs successfully complete only 30% under non-privileged conditions due to challenges in coordination, task tracking, and error recovery.

This paper also discusses the implication of model size on performance, demonstrating that smaller fine-tuned models can achieve performance on par with larger, non-tuned models while remaining substantially more efficient in terms of inference speed. This highlights a promising direction for future research wherein fine-tuning may allow for smaller, less resource-intensive models to approach the efficacy of their larger counterparts.

Practically, the PARTNR benchmark is poised to drive further studies into improving collaborative dynamics in embodied agents. It highlights critical areas such as perception, task division, and error recovery as central challenges that need addressing. The benchmark poses a call to action to the AI community, urging the development of models capable of reasoning through complex sequences of actions autonomously, thus embodying the nuanced capabilities required for genuine human-robot collaboration.

In conclusion, the PARTNR benchmark, through its robust design and comprehensive approach, is expected to stimulate significant progress in AI research focused on embodied multi-agent systems. The necessity for effective collaboration across diverse settings and task types underscores the benchmark’s potential as a tool for spurring innovation and improving the integration of AI in everyday human environments. Future research should exploit this benchmark to explore new paradigms of multi-agent coordination and reasoning, ultimately bridging the gap towards more autonomous and seamless human-robot interactions in the real world.

Youtube Logo Streamline Icon: https://streamlinehq.com