RoCoBench: Multi-Robot Collaboration Benchmark

Updated 16 February 2026

RoCoBench is a benchmark suite for evaluating multi-robot collaboration and reasoning in both simulated and text-only scenarios.
It integrates diverse tasks like Sweep Floor and Arrange Cabinet to test parallel, sequential, and adaptive planning methods.
It benchmarks LLM-driven approaches using both simulation metrics and text-based reasoning to enhance multi-agent performance.

RoCoBench is a benchmark suite for multi-robot collaboration and reasoning, introduced in "RoCo: Dialectic Multi-Robot Collaboration with LLMs" (Mandi et al., 2023). It is designed to evaluate both high-level communication and low-level motion planning in multi-agent settings utilizing LLMs, within a standardized set of tabletop manipulation scenarios. The benchmark covers a broad range of collaborative challenges, agent configurations, and interaction protocols, and includes a text-only dataset for evaluating agent reasoning without physical simulation.

1. Scope and Motivation

RoCoBench addresses the need for systematic, challenging benchmarks that capture the complexities of multi-robot collaboration, including parallel and sequential task decomposition, asymmetric and shared observations, and the interaction between agent dialogue and continuous motion execution. Traditional benchmarks either focus on isolated planning or multi-agent reasoning in abstracted environments; RoCoBench uniquely bridges high-level task allocation and real-world continuous domains where physical overlap, collision constraints, and distributed information fundamentally shape cooperative behavior (Mandi et al., 2023).

2. Task Suite Description

RoCoBench consists of six tasks, each implemented in MuJoCo with varying numbers of robots, gripper types, and object sets:

Task Name	# Agents	Collaboration Pattern	Observation	Workspace Overlap
Sweep Floor	2	Parallel	Asymmetric	Medium
Pack Grocery	2	Parallel	Shared	Medium
Move Rope	2	Parallel	Shared	High
Arrange Cabinet	3	Sequential	Asymmetric	High
Make Sandwich	2	Sequential	Asymmetric	Low
Sort Cubes	3	Sequential (with help)	Shared	Low

Each scenario is tightly specified: robots have distinct capabilities and workspace coverage; the object set and goals drive either parallel execution (e.g., Pack Grocery) or force sequential or cooperative behavior (e.g., Sweep Floor, Arrange Cabinet, Sort Cubes).

Examples:

Sweep Floor: Bob sweeps cubes into Alice's dustpan; Alice must DUMP into bin. Each agent has only local state.
Arrange Cabinet: Alice and Bob open cabinet doors; Chad removes cups/mugs and places them on coasters, requiring communication to solve asymmetric knowledge constraints.

3. The RoCoBench-Text Dataset

RoCoBench includes a text-only dataset, RoCoBench-Text, to evaluate an LLM’s grasp of agent representation, self-knowledge, communication, and adaptation in the absence of simulation:

Self-Knowledge: Capability (reachability questions), Memory (recall information from prior dialogue rounds).
Communication: Inquiry (choosing information-seeking questions), Responsiveness (deciding when/how to help).
Adaptation: Handling unexpected events (e.g., gripper failure, missing objects), selecting contextually appropriate responses.

All questions are in multiple-choice or structured-output form, derived from real RoCoBench dialog and state transcripts.

4. Evaluation Protocol and Metrics

Evaluation involves both simulated robotics and textual reasoning sub-benchmarks:

Simulator-based protocols: Each episode runs for a finite horizon T rounds; in each round, up to K replan attempts are permitted, each validated via parsing, task constraint checking, inverse kinematics, collision checking, and waypoint feasibility.
High-level planner variants:
- Central Plan: Oracle LLM plans all agents’ actions jointly, given the global observation.
- Dialog: Each agent acts based on its own prompt; multi-round dialogue, agent-specific history.
- Variants: Dialog without history, or without detailed feedback.

Metrics:

SuccessRate: $\mathrm{SuccessRate} = \frac{N_{\mathrm{success}}}{N_{\mathrm{total}}}$
EnvSteps: Mean steps to completion, lower is better.
Replans: Average replan attempts.
For text-only benchmarks: answer accuracy (Self-Knowledge, Communication, Adaptation categories).

5. Experimental Results and Baseline Comparisons

Performance comparisons reveal tradeoffs between centralized and decentralized planning/control:

Central Plan achieves uniformly high success on parallel and low-overlap tasks, reaching success rates such as 1.00 on Sweep Floor and 0.82 on Pack Grocery.
Dialog achieves superior flexibility in tasks demanding sequential, collaborative or adaptive coordination, such as a 0.93 success rate on Sort Cubes (versus Central Plan at 0.70). In tasks like Arrange Cabinet, Dialog reaches 0.75 versus Central Plan’s 0.90.
Dialog variants without history or feedback underperform, highlighting the value of persistent memory and detailed replanning information during multi-agent interaction.

In path and motion planning components, LLM-generated waypoints for pick/place operations in high-overlap scenes can outperform hand-coded or linearly interpolated waypoints, doubling planning efficiency.

On RoCoBench-Text, GPT-4 demonstrates 60–90% multiple-choice question accuracy, with substantial variance across reasoning types.

6. Significance, Extensions, and Descendant Benchmarks

RoCoBench establishes criteria for evaluating multi-robot LLM-based systems in both simulation and abstracted reasoning. It underpins subsequent benchmarks, including:

Tool-RoCo (Zhang et al., 26 Nov 2025): Extends RoCoBench principles to long-term, tool-call-based multi-agent settings, introducing agent-as-tool formalism and focusing on distributed autonomy and adaptive team organization.

RoCoBench’s design—emphasizing asymmetric information, multi-round intent signaling, and physically grounded plan validation—has influenced methodological paradigms for emergent communication and reasoning in multi-agent LLM research. The composition of diverse manipulation and reasoning sub-tasks uniquely positions it for benchmarking progress in AI-driven collaborative robotics.

7. Limitations and Open Challenges

While RoCoBench covers a wide range of collaborative scenarios and incorporates both language-based reasoning and continuous-space planning, several open challenges persist:

Generality: Benchmark is currently limited to tabletop manipulation and small-team scenarios; scaling to more heterogeneous agents or dynamic, partially observable environments presents open research problems.
Real-world deployment: End-to-end transfer from simulation to physical robots, including robust conversational grounding and error recovery, requires further investigation.
Dialog grounding: While the evaluative focus on language-driven interaction is a strength, modeling ambiguity, intent uncertainty, and emergent protocol learning remains only partially addressed.

A plausible implication is that RoCoBench and its descendants set a high standard for reproducibility and breadth in multi-robot/agent LLM evaluation, while revealing important gaps in current systems’ robustness and adaptive reasoning (Mandi et al., 2023, Zhang et al., 26 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

RoCo: Dialectic Multi-Robot Collaboration with Large Language Models (2023)

Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoCoBench.