PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks (2411.00081v1)
Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using LLMs, incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.
- Evaluating multi-agent coordination abilities in large language models. arXiv, 2023.
- Taskography: Evaluating robot task planning over large 3d scene graphs. In CoRL, 2022.
- Do as i can and not as i say: Grounding language in robotic affordances. In CoRL, 2022.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
- BostonDynamics. Spot robot. https://www.bostondynamics.com/products/spot. Accessed: 2024-10-01.
- On the utility of learning about humans for human-ai coordination. In NeurIPS, 2019.
- Touchdown: Natural language navigation and spatial reasoning in visual street environments. In CVPR, 2019.
- Embodied question answering. In CVPR, 2018.
- Palm-e: An embodied multimodal language model. In ICML, 2023.
- The llama 3 herd of models. arXiv, 2024.
- Grammar-constrained decoding for structured NLP tasks without finetuning. In EMNLP, 2023.
- Pddl - the planning domain definition language. Technical Report, 1998.
- Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. In ICCV, 2023.
- Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In ICRA, 2024.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv, 2023.
- Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022.
- Inner monologue: Embodied reasoning through planning with language models. In CoRL, 2023.
- Two body problem: Collaborative visual task completion. In CVPR, 2019. first two authors contributed equally.
- A cordial sync: Going beyond marginal policies for multi-agent embodied tasks. In ECCV, 2020.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020.
- Gen2sim: Scaling up robot learning in simulation with generative models. arXiv, 2023.
- Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In CVPR, 2024.
- Beyond the nav-graph: Vision-and-language navigation in continuous environments. In ECCV, 2020.
- Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In EMNLP, 2020.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020.
- Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In CoRL, 2023a.
- Camel: Communicative agents for "mind" exploration of large language model society. In NeurIPS, 2023b.
- Pre-trained language models for interactive decision-making. In NeurIPS, 2022.
- Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2024.
- Openeqa: Embodied question answering in the era of foundation models. In CVPR, 2024.
- Roco: Dialectic multi-robot collaboration with large language models. In ICRA, 2024.
- Robocasa: Large-scale simulation of everyday tasks for generalist robots. In RSS, 2024.
- Iterative reasoning preference optimization. arXiv, 2024.
- Generative agents: Interactive simulacra of human behavior. In ACM Symposium on User Interface Software and Technology, 2023.
- Virtualhome: Simulating household activities via programs. In CVPR, 2018.
- Watch-and-help: A challenge for social perception and human-{ai} collaboration. In ICLR, 2021.
- Nopa: Neurally-guided online probabilistic assistance for building socially intelligent home assistants. In ICRA, 2023.
- Habitat 3.0: A co-habitat for humans, avatars and robots. In ICLR, 2024.
- Team PyTorch. Accelerating generative ai with pytorch ii: Gpt, fast. https://pytorch.org/blog/accelerating-generative-ai-2/, 2023.
- Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In CoRL, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
- Code llama: Open foundation models for code. arXiv, 2023.
- Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2023.
- ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR, 2020.
- ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In ICLR, 2021.
- Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In CoRL, 2022.
- Neural mmo: A massively multiagent game environment for training and evaluating intelligent agents. arXiv, 2019.
- Habitat 2.0: Training home assistants to rearrange their habitat. In NeurIPS, 2021.
- Adaptive coordination in social embodied rearrangement. In ICML, 2023.
- Large language models as generalizable policies for embodied tasks. In ICLR, 2024.
- Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. In ICML, 2024.
- Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. In RSS, 2024.
- Towards a foundation model for generalist robots: Diverse skill learning at scale via automated task and scene generation. arXiv, 2023.
- Language models meet world models: Embodied experiences enhance language models. In NeurIPS, 2023.
- React: Synergizing reasoning and acting in language models. In ICLR, 2023.
- Homerobot: Open-vocabulary mobile manipulation. In CoRL, 2023.
- Asc: Adaptive skill coordination for robotic mobile manipulation. IEEE Robotics and Automation Letters, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2022.
- Building cooperative embodied agents modularly with large language models. ICLR, 2024a.
- Combo: Compositional world models for embodied multi-agent cooperation. arXiv, 2024b.
- Large language models as commonsense knowledge for large-scale task planning. In NeurIPS, 2024.
- Sotopia: Interactive evaluation for social intelligence in language agents. In ICLR, 2024.
- Excalibur: Encouraging and evaluating embodied exploration. In CVPR, 2023.