The paper introduces MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across various interactive scenarios. The framework measures task completion and the quality of collaboration and competition using milestone-based KPI (Key Performance Indicator)s. It also evaluates coordination protocols, such as star, chain, tree, and graph topologies, and strategies like group discussion and cognitive planning. The authors introduce MARBLE (Multi-agent cooRdination Backbone with LLM Engine) framework, which rigorously evaluates LLM-based multi-agent systems in six diverse interactive scenarios, capturing both collaborative and competitive dynamics.
The evaluation framework, MARBLE, uses interconnected modules for collaboration, communication, and task execution. The Coordination Engine initializes and synchronizes the Agent Graph, Cognitive Module, and Coordinate Engine.
The Agent Graph Module converts configuration data into a structured graph , where is the set of agents, and each edge in is a triple with representing the relationship between agents and .
- : Set of agents
- : Set of edges
- : Agents
- : Relationship between agents
The Cognitive Module maintains an internal state that includes each agent’s persona, inter-agent relationships, and reasoning strategies, mirroring human cognitive processes.
The Coordination Engine orchestrates the system execution flow, initializing agents, tasks, and relationships via a Configuration Module, constructing the Agent Graph, and distinguishing between planners and actors. It supports four coordination protocols: star, tree, graph, and chain. The planner supports vanilla prompting, Chain-of-Thought (CoT), group discussion, and cognitive self-evolving planning.
The benchmark includes diverse scenarios spanning task-oriented and social-simulation-based environments. Task-oriented scenarios include research tasks, Minecraft-based building tasks, database error analysis, and coding challenges. Social-simulation-based scenarios include Werewolf and Bargaining. Each scenario enforces distinct agent roles and defines specific graph relationships. Each task is segmented into flexible milestones monitored by a LLM-based detector.
The evaluation considers Task Completion Performance and Coordination. Task Completion Metrics include milestone-based KPIs and a separate task-based score. The overall KPI is defined as:
- : Number of agents
- : Individual KPI for agent
- : Total number of milestones
- : Number of milestones agent contributes to
Coordination Metrics evaluate communication and planning capabilities, with a Communication Score () and a Planning Score () derived from LLM-based evaluations on a five-point scale. The overall Coordination Score (CS) is the average of these two sub-scores.
The experiment settings involve three open-source models (Meta-Llama-3.3-70B, Meta-Llama-3.1-70B-Instruct-Turbo, and Meta-Llama-3.1-8B-Instruct-Turbo) and two closed-source models (GPT-3.5-turbo-0125 and GPT-4o-mini). The models are configured with a maximum token number of 1024, a temperature of 0.7, and a top_p of 1.0. The overall maximum iterations are set to 5 for research and 20 for Minecraft. The evaluation assesses models along two primary axes: Task Score (TS) and Coordination Score (CS). A graph-mesh coordination protocol is adopted to facilitate interactions.
Superior task performance of gpt-4o-mini is observed across multiple tasks. For example, in the Research scenario it obtains a TS of 84.13\%, outperforming other models such as Meta-Llama-3.1-8B (80.87\%) and Meta-Llama-3.1-70B (80.80\%).
The impact of different collaboration protocols (Star, Tree, Graph, and Chain) on model performance in the Research scenario is investigated. The graph-based protocol excels in research scenarios, while the tree-based protocol performs poorly. Cognitive Evolving Planning demonstrates superior coordination, and the group discussion method scores the worst across all metrics.
Ablation studies identify key modules and parameters that affect performance. Both task and coordination scores increase from 1 to 7 iterations, but then drop sharply at 10 iterations. Increasing the number of agents leads to a decrease in the overall KPI.
In MultiagentBench, goal-driven emergent behaviors are pivotal to team coordination. Under information asymmetry and role conflicts, agents display three key patterns: strategic information sharing, trust-polarized collaboration, and role-driven strategy iteration.
The authors conclude by highlighting the need for expanding scenario and model coverage, enhancing ablation studies, advancing competition mechanisms, and handling open-ended and ill-defined tasks.