Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents (2503.01935v1)

Published 3 Mar 2025 in cs.MA, cs.AI, cs.CL, and cs.CY
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Abstract: LLMs have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

The paper introduces MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across various interactive scenarios. The framework measures task completion and the quality of collaboration and competition using milestone-based KPI (Key Performance Indicator)s. It also evaluates coordination protocols, such as star, chain, tree, and graph topologies, and strategies like group discussion and cognitive planning. The authors introduce MARBLE (Multi-agent cooRdination Backbone with LLM Engine) framework, which rigorously evaluates LLM-based multi-agent systems in six diverse interactive scenarios, capturing both collaborative and competitive dynamics.

The evaluation framework, MARBLE, uses interconnected modules for collaboration, communication, and task execution. The Coordination Engine initializes and synchronizes the Agent Graph, Cognitive Module, and Coordinate Engine.

The Agent Graph Module converts configuration data into a structured graph G=(A,E)G = (\mathcal{A}, E), where A={a1,a2,,an}\mathcal{A} = \{a_1, a_2, \dots, a_n\} is the set of agents, and each edge in EE is a triple (ai,r,aj)(a_i, r, a_j) with rRr \in \mathcal{R} representing the relationship between agents aia_i and aja_j.

  • A\mathcal{A}: Set of agents
  • EE: Set of edges
  • ai,aja_i, a_j: Agents
  • rr: Relationship between agents

The Cognitive Module maintains an internal state that includes each agent’s persona, inter-agent relationships, and reasoning strategies, mirroring human cognitive processes.

The Coordination Engine orchestrates the system execution flow, initializing agents, tasks, and relationships via a Configuration Module, constructing the Agent Graph, and distinguishing between planners and actors. It supports four coordination protocols: star, tree, graph, and chain. The planner supports vanilla prompting, Chain-of-Thought (CoT), group discussion, and cognitive self-evolving planning.

The benchmark includes diverse scenarios spanning task-oriented and social-simulation-based environments. Task-oriented scenarios include research tasks, Minecraft-based building tasks, database error analysis, and coding challenges. Social-simulation-based scenarios include Werewolf and Bargaining. Each scenario enforces distinct agent roles and defines specific graph relationships. Each task is segmented into flexible milestones monitored by a LLM-based detector.

The evaluation considers Task Completion Performance and Coordination. Task Completion Metrics include milestone-based KPIs and a separate task-based score. The overall KPI is defined as: KPIoverall=1Nj=1NKPIj=1NMj=1Nnj.\text{KPI}_{\text{overall}} = \frac{1}{N}\sum_{j=1}^{N}\text{KPI}_j = \frac{1}{NM}\sum_{j=1}^{N} n_j.

  • NN: Number of agents
  • KPIjKPI_j: Individual KPI for agent jj
  • MM: Total number of milestones
  • njn_j: Number of milestones agent jj contributes to

Coordination Metrics evaluate communication and planning capabilities, with a Communication Score (CscoreC_{\text{score}}) and a Planning Score (PscoreP_{\text{score}}) derived from LLM-based evaluations on a five-point scale. The overall Coordination Score (CS) is the average of these two sub-scores.

The experiment settings involve three open-source models (Meta-Llama-3.3-70B, Meta-Llama-3.1-70B-Instruct-Turbo, and Meta-Llama-3.1-8B-Instruct-Turbo) and two closed-source models (GPT-3.5-turbo-0125 and GPT-4o-mini). The models are configured with a maximum token number of 1024, a temperature of 0.7, and a top_p of 1.0. The overall maximum iterations are set to 5 for research and 20 for Minecraft. The evaluation assesses models along two primary axes: Task Score (TS) and Coordination Score (CS). A graph-mesh coordination protocol is adopted to facilitate interactions.

Superior task performance of gpt-4o-mini is observed across multiple tasks. For example, in the Research scenario it obtains a TS of 84.13\%, outperforming other models such as Meta-Llama-3.1-8B (80.87\%) and Meta-Llama-3.1-70B (80.80\%).

The impact of different collaboration protocols (Star, Tree, Graph, and Chain) on model performance in the Research scenario is investigated. The graph-based protocol excels in research scenarios, while the tree-based protocol performs poorly. Cognitive Evolving Planning demonstrates superior coordination, and the group discussion method scores the worst across all metrics.

Ablation studies identify key modules and parameters that affect performance. Both task and coordination scores increase from 1 to 7 iterations, but then drop sharply at 10 iterations. Increasing the number of agents leads to a decrease in the overall KPI.

In MultiagentBench, goal-driven emergent behaviors are pivotal to team coordination. Under information asymmetry and role conflicts, agents display three key patterns: strategic information sharing, trust-polarized collaboration, and role-driven strategy iteration.

The authors conclude by highlighting the need for expanding scenario and model coverage, enhancing ablation studies, advancing competition mechanisms, and handling open-ended and ill-defined tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Kunlun Zhu (12 papers)
  2. Hongyi Du (3 papers)
  3. Zhaochen Hong (5 papers)
  4. Xiaocheng Yang (11 papers)
  5. Shuyi Guo (2 papers)
  6. Zhe Wang (574 papers)
  7. Zhenhailong Wang (17 papers)
  8. Cheng Qian (81 papers)
  9. Xiangru Tang (62 papers)
  10. Heng Ji (266 papers)
  11. Jiaxuan You (50 papers)
Github Logo Streamline Icon: https://streamlinehq.com