Papers
Topics
Authors
Recent
Search
2000 character limit reached

ColBench: Collaborative Agent Benchmark

Updated 25 January 2026
  • ColBench is a family of large-scale benchmarks designed for evaluating multi-turn reinforcement learning in collaborative AI-human tasks.
  • It leverages diverse, procedurally generated tasks with partial observability and sparse rewards to rigorously assess adaptation, reasoning, and communication.
  • ColBench supports robust research by integrating automated reward evaluation, fine-grained credit assignment, and real-time teaming scenarios.

ColBench (Collaborative Agent Benchmark) denotes a family of principled, large-scale benchmarks for studying collaboration between AI agents (including LLMs and embodied agents) and humans—real or simulated—across complex, multi-turn, real-world tasks. As of 2026, the dominant instantiations of ColBench set the foundation for research on @@@@0@@@@ (RL) for LLM agents in collaborative artifact creation (Zhou et al., 19 Mar 2025), and for real-time human-AI teaming in embodied environments emphasizing reactive adaptation and natural language communication (Liu et al., 2024).

1. Motivation and Scope

ColBench addresses the inadequacy of prior LLM agent and embodied agent benchmarks along several dimensions:

  • Sufficient task diversity for RL training without overfitting, with large sets (>10,000) of procedurally varied tasks.
  • Intrinsic task complexity that necessitates challenging reasoning, handling of partial observability, and generalization to edge cases.
  • Minimal engineering overhead, leveraging LLM-based human simulators and automated, functionally grounded reward evaluators.
  • Fine-grained, step-level or turn-level credit assignment and behavior assessment, supporting quantitative study of both adaptation and communicative ability.

ColBench currently comprises two principal paradigms: artifact-creation collaboration with LLMs (backend programming, frontend design), and real-time, embodied collaboration in Overcooked-AI kitchen environments, focused on subtask/path-level adaptation and language-based coordination.

2. ColBench for Collaborative Reasoning LLM Agents

The version of ColBench developed in (Zhou et al., 19 Mar 2025) targets multi-turn LLM agents that partner with simulated humans on realistic content-creation tasks:

  • Backend Programming Collaboration: The agent is tasked with implementing a Python function (≤50 lines) that matches a hidden reference. The interaction protocol grants up to 10 rounds for the agent to ask clarifying questions (natural language) and iteratively refine its code. The human simulator, instantiated as a large LLM with reference access, supplies only informational responses—never code.
  • Frontend Design Collaboration: The agent generates an HTML+Tailwind-CSS snippet (~100 lines) that visually matches a secret reference web page. After an initial high-level description, up to 10 turns allow the agent to submit HTML proposals; the simulator provides difference feedback based on rendered image comparison.

Both domains structure each task as a finite-horizon partially observable Markov decision process (POMDP),

M=(O,C,A,T,μ1,r,N),\mathcal{M} = (\mathcal{O}, \mathcal{C}, \mathcal{A}, \mathcal{T}, \mu_1, r, N),

where C\mathcal{C} is the hidden reference, O\mathcal{O} is the interaction history, A\mathcal{A} is the agent’s action space, and rr is the sparse reward only at the final turn.

Environment and Data Generation exploits:

  • Automated extraction of backend functions/unit tests via Llama-3.1-70B prompt chains over GitHub-style code.
  • Selection of frontend tasks from the WebSight corpus, paired with design prompts.
  • LLM-based simulators as reliable, reference-grounded human models.
  • 10,000 train and 1,000–1,500 test tasks per domain for robust RL and evaluation.

Offline settings dominate, with 15,000 backend and 6,000 frontend trajectories collected via zero-shot prompting.

3. ColBench for Real-Time Embodied Collaboration

ColBench as formalized in (Liu et al., 2024) enables quantitative evaluation of real-time adaptive collaboration and language communication in embodied agent settings, exemplified by the Overcooked-AI environment:

  • Two-agent teaming: An AI agent and a simulated human (greedy+auto-unstuck policy) jointly complete recipe-driven kitchen tasks with obstacles, involving path collisions and subtask conflicts.
  • Three benchmark modes:
    • Overall Testing: 22 kitchen layouts with varied teaming-fluency levels.
    • Path Adaptation Testing: 43 curated frames where the AI or human must reroute to avoid collision.
    • Subtask Adaptation Testing: 41 frames requiring correct subtask yielding or role switching.

Human behaviors in these settings are controllable and well-labeled, supporting explicit “self-adapt,” “other-adapt,” and “both-ok” adaptation variants.

4. Metrics and Evaluation Protocols

ColBench applies rigorous and domain-specific metrics.

Artifact-Creation Domain (Zhou et al., 19 Mar 2025):

  • Backend
    • Test-Pass Rate: fraction of hidden unit tests passed.
    • Success Rate: proportion of tasks with all tests passed.
  • Frontend
    • Cosine Similarity (CosSim): between CLIP embeddings of agent vs. reference renderings, [0,1]\in[0,1].
    • Win Rate: fraction of tasks outperforming a baseline model.

Embodied/Real-Time Domain (Liu et al., 2024):

  • Reactive Adaptability (AA): mean adaptation indicator δt\delta_t (whether the agent changes plan at atomic step tt), compared to expert ground truth for correctness.
  • Communication Fidelity (CC): average GPT-4o embedding similarity between agent and reference utterances, time-decayed for misaligned response timing.
  • Stuck Time: cumulative steps spent unable to progress due to conflicts.

Evaluation protocols ensure comprehensive sampling:

  • Full-layout runs (multiple 5-minute episodes, varied agents).
  • Short-horizon scenario repeats.
  • Explicit, frame-wise labeling/ground truth for adaptation scenarios.

5. Baseline Algorithms and Results

LLM Artifact-Creation Benchmarks (Zhou et al., 19 Mar 2025):

  • Baselines: Zero-Shot; Rejection Fine-Tuning (SFT) on successful demos; Multi-Turn DPO; and SWEET-RL (Step-WisE Evaluation from Training-time information).
  • SWEET-RL, leveraging a training-time critic for step-level reward assignment, yields a 6 percentage point absolute improvement in success/win rates relative to previous multi-turn RL methods, with Llama-3.1-8B matching or exceeding GPT-4O on ColBench tasks.
  • Table of test-set metrics across models:
Method Backend: %Tests/Success Frontend: CosSim/Win Rate
Llama-3.1-8B (1-turn) 11.8 / 6.9 % 63.1 / 13.6 %
GPT-4O (Multi-turn) 54.6 / 40.4 % 78.1 / 50.0 %
SWEET-RL (Llama-3.1-8B) 56.8 / 40.4 % 77.7 / 48.2 %
Multi-Turn DPO 48.0 / 34.4 % 76.9 / 42.8 %

Embodied Agent Benchmarks (Liu et al., 2024):

  • The Monitor-then-Adapt (MonTA) framework, comprising a high-frequency Monitor (System 1) and dual Adapters (System 2) for subtask and path adaptation reasoning, establishes state-of-the-art adaptability and communication fidelity.
  • On challenging layouts (e.g., Layout 27 with 16.7 % fluency), MonTA (GPT-4o) achieves 76.6 mean reward versus 5 (SAA) and 0 (greedy+auto-unstuck).
  • Path adaptation success approaches 100 % across scenario types for MonTA (GPT-4o).
  • Human expert evaluations yield mean reasonability ≈ 3.5/5 and consistency ≈ 3.9/5 for MonTA-generated instructions; ≈75% of scenarios judged both reasonable and consistent.

6. Key Challenges and Limitations

Central challenges highlighted by ColBench benchmarks include:

  • Partial Observability and Sparse Reward: Agents must infer hidden task parameters through limited interaction. Multi-step credit assignment is correspondingly complex.
  • Sample and Data Efficiency: The number of fine-tuning trajectories (15k for backend, 6k for frontend) is small relative to pretraining datasets, necessitating algorithms capable of generalizing from limited data.
  • Human-Simulator Gap: Simulated users, while consistent and cost-effective, lack the full unpredictability, error-proneness, and implicitness of real human collaborators.
  • Latency Constraints: LLM-based adapters for real-time embodied tasks introduce latency (e.g., 2.1 s for subtask adaptation in GPT-4o) which may not suffice for physical robot settings.
  • Domain Scope: Artifact creation benchmarks focus on code and HTML; embodied agent benchmarks are currently restricted to Overcooked-AI.

Bridging sim-to-real gaps, scaling to richer modalities, supporting inconsistent or noisy human feedback, and achieving sub-100 ms agent response remain recognized extensions.

7. Prospective Extensions and Impact

Possible next steps, motivated by ColBench’s structure, include:

  • Expanding task domains to multi-page web apps, diagram or slide generation, and device control.
  • Studying online RL with human-in-the-loop data acquisition.
  • Generalizing to more complex embodied domains (search-and-rescue, assembly, multi-agent settings).
  • Developing robust, efficient monitors/adapters suitable for on-robot deployment.
  • Investigating adversarial robustness and safety in multi-turn collaboration.

ColBench is positioned as a scalable, reproducible, and engineering-light platform central to benchmarking progress in multi-turn RL for collaborative AI—both in the language-model and embodied agent paradigms (Zhou et al., 19 Mar 2025, Liu et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Collaborative Agent Benchmark (ColBench).