RoCoBench: Multi-Robot Collaboration Benchmark
- RoCoBench is a benchmark suite for evaluating multi-robot collaboration and reasoning in both simulated and text-only scenarios.
- It integrates diverse tasks like Sweep Floor and Arrange Cabinet to test parallel, sequential, and adaptive planning methods.
- It benchmarks LLM-driven approaches using both simulation metrics and text-based reasoning to enhance multi-agent performance.
RoCoBench is a benchmark suite for multi-robot collaboration and reasoning, introduced in "RoCo: Dialectic Multi-Robot Collaboration with LLMs" (Mandi et al., 2023). It is designed to evaluate both high-level communication and low-level motion planning in multi-agent settings utilizing LLMs, within a standardized set of tabletop manipulation scenarios. The benchmark covers a broad range of collaborative challenges, agent configurations, and interaction protocols, and includes a text-only dataset for evaluating agent reasoning without physical simulation.
1. Scope and Motivation
RoCoBench addresses the need for systematic, challenging benchmarks that capture the complexities of multi-robot collaboration, including parallel and sequential task decomposition, asymmetric and shared observations, and the interaction between agent dialogue and continuous motion execution. Traditional benchmarks either focus on isolated planning or multi-agent reasoning in abstracted environments; RoCoBench uniquely bridges high-level task allocation and real-world continuous domains where physical overlap, collision constraints, and distributed information fundamentally shape cooperative behavior (Mandi et al., 2023).
2. Task Suite Description
RoCoBench consists of six tasks, each implemented in MuJoCo with varying numbers of robots, gripper types, and object sets:
| Task Name | # Agents | Collaboration Pattern | Observation | Workspace Overlap |
|---|---|---|---|---|
| Sweep Floor | 2 | Parallel | Asymmetric | Medium |
| Pack Grocery | 2 | Parallel | Shared | Medium |
| Move Rope | 2 | Parallel | Shared | High |
| Arrange Cabinet | 3 | Sequential | Asymmetric | High |
| Make Sandwich | 2 | Sequential | Asymmetric | Low |
| Sort Cubes | 3 | Sequential (with help) | Shared | Low |
Each scenario is tightly specified: robots have distinct capabilities and workspace coverage; the object set and goals drive either parallel execution (e.g., Pack Grocery) or force sequential or cooperative behavior (e.g., Sweep Floor, Arrange Cabinet, Sort Cubes).
Examples:
- Sweep Floor: Bob sweeps cubes into Alice's dustpan; Alice must DUMP into bin. Each agent has only local state.
- Arrange Cabinet: Alice and Bob open cabinet doors; Chad removes cups/mugs and places them on coasters, requiring communication to solve asymmetric knowledge constraints.
3. The RoCoBench-Text Dataset
RoCoBench includes a text-only dataset, RoCoBench-Text, to evaluate an LLM’s grasp of agent representation, self-knowledge, communication, and adaptation in the absence of simulation:
- Self-Knowledge: Capability (reachability questions), Memory (recall information from prior dialogue rounds).
- Communication: Inquiry (choosing information-seeking questions), Responsiveness (deciding when/how to help).
- Adaptation: Handling unexpected events (e.g., gripper failure, missing objects), selecting contextually appropriate responses.
All questions are in multiple-choice or structured-output form, derived from real RoCoBench dialog and state transcripts.
4. Evaluation Protocol and Metrics
Evaluation involves both simulated robotics and textual reasoning sub-benchmarks:
- Simulator-based protocols: Each episode runs for a finite horizon T rounds; in each round, up to K replan attempts are permitted, each validated via parsing, task constraint checking, inverse kinematics, collision checking, and waypoint feasibility.
- High-level planner variants:
- Central Plan: Oracle LLM plans all agents’ actions jointly, given the global observation.
- Dialog: Each agent acts based on its own prompt; multi-round dialogue, agent-specific history.
- Variants: Dialog without history, or without detailed feedback.
Metrics:
- SuccessRate:
- EnvSteps: Mean steps to completion, lower is better.
- Replans: Average replan attempts.
- For text-only benchmarks: answer accuracy (Self-Knowledge, Communication, Adaptation categories).
5. Experimental Results and Baseline Comparisons
Performance comparisons reveal tradeoffs between centralized and decentralized planning/control:
- Central Plan achieves uniformly high success on parallel and low-overlap tasks, reaching success rates such as 1.00 on Sweep Floor and 0.82 on Pack Grocery.
- Dialog achieves superior flexibility in tasks demanding sequential, collaborative or adaptive coordination, such as a 0.93 success rate on Sort Cubes (versus Central Plan at 0.70). In tasks like Arrange Cabinet, Dialog reaches 0.75 versus Central Plan’s 0.90.
- Dialog variants without history or feedback underperform, highlighting the value of persistent memory and detailed replanning information during multi-agent interaction.
In path and motion planning components, LLM-generated waypoints for pick/place operations in high-overlap scenes can outperform hand-coded or linearly interpolated waypoints, doubling planning efficiency.
On RoCoBench-Text, GPT-4 demonstrates 60–90% multiple-choice question accuracy, with substantial variance across reasoning types.
6. Significance, Extensions, and Descendant Benchmarks
RoCoBench establishes criteria for evaluating multi-robot LLM-based systems in both simulation and abstracted reasoning. It underpins subsequent benchmarks, including:
- Tool-RoCo (Zhang et al., 26 Nov 2025): Extends RoCoBench principles to long-term, tool-call-based multi-agent settings, introducing agent-as-tool formalism and focusing on distributed autonomy and adaptive team organization.
RoCoBench’s design—emphasizing asymmetric information, multi-round intent signaling, and physically grounded plan validation—has influenced methodological paradigms for emergent communication and reasoning in multi-agent LLM research. The composition of diverse manipulation and reasoning sub-tasks uniquely positions it for benchmarking progress in AI-driven collaborative robotics.
7. Limitations and Open Challenges
While RoCoBench covers a wide range of collaborative scenarios and incorporates both language-based reasoning and continuous-space planning, several open challenges persist:
- Generality: Benchmark is currently limited to tabletop manipulation and small-team scenarios; scaling to more heterogeneous agents or dynamic, partially observable environments presents open research problems.
- Real-world deployment: End-to-end transfer from simulation to physical robots, including robust conversational grounding and error recovery, requires further investigation.
- Dialog grounding: While the evaluative focus on language-driven interaction is a strength, modeling ambiguity, intent uncertainty, and emergent protocol learning remains only partially addressed.
A plausible implication is that RoCoBench and its descendants set a high standard for reproducibility and breadth in multi-robot/agent LLM evaluation, while revealing important gaps in current systems’ robustness and adaptive reasoning (Mandi et al., 2023, Zhang et al., 26 Nov 2025).