MultiAgentBench: LLM Multi-Agent Benchmark

Updated 27 November 2025

MultiAgentBench is a benchmark suite designed to evaluate LLM-based multi-agent systems through diverse cooperative and adversarial scenarios.
It employs milestone-based KPIs and metrics like communication, planning, and coordination scores to assess agent performance.
The benchmark compares various coordination protocols and planning strategies, revealing insights on scalability, function-call reliability, and emergent behaviors.

MultiAgentBench is a comprehensive benchmark suite for evaluating LLM-based multi-agent systems in scenarios requiring collaboration, coordination, and competition. Distinct from single-agent or narrowly scoped evaluation, MultiAgentBench introduces diverse, interactive tasks spanning mutual-goal (collaborative) and conflicting-goal (competitive) environments, accompanied by novel milestone-based key performance indicators designed to measure not just task completion but also qualitative aspects of agent interaction and emergent multi-agent behavior (Zhu et al., 3 Mar 2025).

1. Benchmark Scope and Scenario Design

MultiAgentBench targets the assessment of multi-agent LLM systems across a spectrum of domains and task types, including both cooperative and adversarial dynamics. Its core scenario suite comprises:

Task-Oriented (Mutual-Goal) Scenarios
- Research Collaboration: N LLM agents with distinct research profiles jointly co-author a proposal using a structured "5-question" (5q) format. Agents access tools for literature retrieval and co-author network analysis. Milestones are keyed to the stages of 5q formation, improvement, and finalization. The process involves task planning, sub-question division, iterative discussion, literature lookup, and rubric-based evaluation (dimensions: innovation, safety, feasibility).
- Minecraft Building: Agents collaborate within a text-based Mineflayer world to assemble target structures described by block types/locations. Tasks involve blueprint parsing, assignment of construction regions, cooperative material acquisition, and distributed block placement within seeded episode limits, with environment-provided hit-rate metrics.
- Database Error Analysis: Each of five specialized agents diagnoses different database anomalies in a live PostgreSQL instance, conducting queries, sharing discoveries, and converging on root-cause hypotheses.
- Coding Collaboration: Agents occupy specialized code roles (e.g., debugging, test writing) to jointly solve SRDD-derived challenges using modular planning, iterative implementation, and review cycles.
Social Simulation (Conflicting-Goal) Scenarios
- Bargaining: Two buyers and two sellers—each endowed with a Big-Five personality profile—negotiate over Amazon products, utilizing naturalistic communication tools and scored by effectiveness, progress, and interaction quality.
- Werewolf (Social Deduction): Agents assigned to roles in the standard village-versus-werewolf setting operate according to game rules and deduced social strategies, evaluated both on daily event scores and net win/loss outcomes (Zhu et al., 3 Mar 2025).

Each scenario suite is constructed with explicit data splits (100 research, Minecraft, bargaining, and werewolf scenarios; 10 for databases; 50–100 for coding) and supports variable agent counts, encouraging the paper of scaling effects.

2. Milestone-Based Evaluation and Key Performance Indicators

MultiAgentBench introduces a milestone-driven KPI framework in which agent contribution to task-relevant milestones serves as the foundation for collaboration/competition assessment:

For episode with M milestones $m_1,\dots,m_M$ and N agents, agent j’s KPI is $\text{KPI}_j = \frac{n_j}{M}$ , where $n_j$ is the number of milestones to which j contributed. The overall KPI is $\text{KPI}_\text{overall} = \frac{1}{N} \sum_{j=1}^N \text{KPI}_j$ .
Secondary metrics per episode include:
- Communication Score ( $C_\text{score}$ ), Planning Score ( $P_\text{score}$ ): Both in [0,5], automatically rated via LLM evaluation of agent logs.
- Coordination Score ( $CS$ ): Defined as the mean of $C_\text{score}$ and $P_\text{score}$ .
- Competition Scores for adversarial domains: derived from process-level metrics (e.g., Werewolf net score, bargaining concession balance), scaled to 0,100.

Milestone attribution and metric aggregation are automated via prompt-based assistants, permitting fine-grained, longitudinal comparison of agent contributions and planning/coordination quality.

3. Coordination Protocols and Topological Variants

MultiAgentBench systematically compares four coordination topologies for agent communication and synchronization:

Star (Centralized): One planner supervises all actors; bidirectional communication; low parallelism, strong oversight.
Tree (Hierarchical): Root planner distributes subtasks to intermediate sub-planners—enabling controlled parallelism with increased communication overhead.
Chain (Sequential): Agents pass state in strict sequence; supports dependency chaining but limited parallel execution.
Graph-Mesh (Fully Decentralized): All agents exchange messages pairwise, supporting high communication bandwidth and the potential for consensus mechanisms via voting or weighted averaging.

Ablation studies reveal that the graph-mesh topology yields the best task score (TS), planning efficiency, and moderate token consumption, outperforming both hierarchical and chain structures. Star topology achieves comparable TS to graph-mesh in some contexts but generally with reduced parallelism. Tree topology performs worst, and chain structure is intermediate (Zhu et al., 3 Mar 2025).

4. Agent Strategies: Planning, Group Discussion, and Self-Evolving Protocols

MultiAgentBench evaluates not only static planning but also adaptive communication and planning paradigms:

Group Discussion: Agents are prompted to propose subtasks and constraints, with a planner aggregating these into consensus plans. The iterative process fosters explicit negotiation at the cost of communication overhead.
Cognitive Self-Evolving Planning: Inspired by Reflexion, planners generate expected outcomes and sub-milestones, agents execute and log outcomes, and the planner contrasts execution with plan, updating an experience memory for future iterations via retrieval-augmented prompting. This paradigm improves coordination score by approximately 3% over vanilla planning (Zhu et al., 3 Mar 2025).

Results demonstrate that cognitive self-evolving planning achieves the highest coordination score (~4.8/5) with milestone achievement rates comparable to chain-of-thought prompting, while group discussion underperforms due to increased overhead.

5. LLM Agent Configuration and Ablation Findings

Agents in MultiAgentBench are instantiated using both open-source (Meta-Llama-3.1-8B, Meta-Llama-3.3-70B) and closed-source (gpt-3.5-turbo-0125, gpt-4o-mini) models. Configurations use a max_token_num=1024, temperature=0.7, and top_p=1.0. Each scenario enforces distinct maximum episode length (e.g., 5 for research, 20 for Minecraft, 5 communication rounds).

Key experimental findings:

Model Comparison: gpt-4o-mini achieves the highest average task and coordination scores in research and coding scenarios. Coordination score correlates strongly with outcome in most, but not all, scenarios; model function-call reliability significantly impacts applied task success (e.g., Llama-3.1-70B reaches high CS but low TS when function-call failures occur).
Coordination Protocols: Graph-mesh generally yields the best results across metrics; tree topology is least effective.
Planning Strategy: Cognitive evolution improves milestone completion by +3% over vanilla planning.
Iteration and Agent Count: Task and coordination scores typically rise with additional iterations up to a threshold, after which performance plateaus or dips. Increasing agent count tends to decrease per-agent contribution but increases total task score up to a saturation point (Zhu et al., 3 Mar 2025).

6. Emergent Behaviors, Limitations, and Extensions

MultiAgentBench facilitates the observation of emergent multi-agent behaviors, including:

Strategic Information Sharing: Sub-optimal transparency (e.g., delayed disclosures in Werewolf) led to task failure.
Trust Polarization: Internal distrust among collaborating agents is exploitable by adversaries, as seen in social deduction tasks.
Role-Driven Dynamics: Agents with privileged information (e.g., Seer/Witch in Werewolf) transition from passive to leadership roles across iterations.

Limitations include reliance on heuristic toolkits for planning evaluation, possible overfitting to procedural task specifications, and scenario distribution coverage. Generalization across open- and closed-source LLMs remains an open question, as function-call robustness and communication pacing impact benchmark outcomes.

7. Position within the Multi-Agent LLM Benchmark Ecosystem

MultiAgentBench is designed to fill the gap left by single-agent or domain-specific benchmarks, providing a multi-domain, interactive, and deeply instrumented testbed. Its hallmark contributions are the formalization of milestone-based KPIs, coverage of both cooperative and adversarial scenarios, systematic evaluation of protocol topologies, and rigorous ablation studies of coordination and planning strategies. All code and datasets are public, enabling reproducibility and extension (Zhu et al., 3 Mar 2025).

In distinction, related benchmarks such as BattleAgentBench (Wang et al., 28 Aug 2024) focus on sequential game-theoretic dynamics in a single domain, POGEMA (Skrynnik et al., 20 Jul 2024) targets multi-agent pathfinding, and MOMAland (Felten et al., 23 Jul 2024) addresses multi-objective MARL. MultiAgentBench’s breadth in agent interaction and scenario diversity, as well as its explicit design for evaluating emergent LLM-driven collaboration and competition, establish it as a foundational resource for the next generation of multi-agent LLM research.