MAFBench: Multi-Agent Framework Benchmark

Updated 10 April 2026

MAFBench is a unified evaluation suite for multi-agent LLM frameworks that isolates architectural effects from model quality and prompt engineering.
It standardizes benchmarks across memory, planning, specialization, and coordination to compare orchestration overhead and design choices in diverse environments.
Empirical results reveal that framework-level decisions dramatically influence latency, throughput, and task success, offering actionable insights for system optimization.

MAFBench (Multi-Agent Framework Benchmark) is a unified evaluation suite designed for systematic, controlled comparison of multi-agent LLM frameworks. Its primary aim is to elucidate how framework-level architectural choices—distinct from LLM model quality or prompt engineering—govern system performance, cost, and robustness. MAFBench integrates previously isolated benchmarks under a standardized pipeline, allowing direct measurement of the impact of orchestration, memory abstractions, planning interfaces, agent specialization, and coordination mechanisms across heterogeneous multi-agent environments. By fixing underlying model parameters, prompts, and data, and varying only the framework layer, MAFBench isolates the architectural effects that induce substantial variations in system behavior and efficiency (Orogat et al., 3 Feb 2026).

1. Definition, Motivation, and Scope

MAFBench addresses the lack of standardized, framework-level benchmarks for multi-agent LLM systems. While existing benchmarks focus on capabilities such as tool use, retrieval, or reasoning at the agent level, they are not designed for cross-framework comparison. MAFBench brings together benchmarks covering memory, planning, specialization, and coordination within a single execution and logging environment.

Motivation for MAFBench originates from the necessity to compare architectural choices, such as orchestration overhead, memory strategies, and interaction topologies, which alone can cause order-of-magnitude differences in latency, throughput, accuracy, and scalability. MAFBench achieves this by holding LLM(s), prompts, and input data constant and varying only the framework implementation.

Core to MAFBench are two formal definitions:

Agent: $a = (\mathcal{R}, \mathcal{Y}, \mathcal{P}, \mathcal{S}, \mathcal{T}, f)$ , where $\mathcal{R}$ is role/specialization, $\mathcal{Y}$ objectives, $\mathcal{P}$ planning, $\mathcal{S}$ storage/memory, $\mathcal{T}$ tools, and $f$ the LLM reasoning function.
Framework: $\mathbf{F} = (\{a_i\}_{i=1}^n, \mathcal{O}, \mathcal{C}, \mathcal{E})$ , with $\{a_i\}$ agents, $\mathcal{O}$ orchestration/control flow, $\mathcal{R}$ 0 communication topology, and $\mathcal{R}$ 1 optional environment.

2. Architectural Taxonomy

MAFBench classifies frameworks along two axes: Architectural Paradigms and Design Dimensions.

Architectural Paradigms

Graph-Based (e.g., LangGraph): Explicit DAG workflows determine control/data flow.
Role-Based (e.g., CrewAI, AutoGen, OpenAI SDK): Coordination via textual role specifications and manager–worker delegation.
GABM (e.g., Concordia): Environment-mediated agent interactions without direct peer messaging.

Fundamental Design Dimensions

Orchestration/control flow: Fixed DAG; role-conditioned; emergent loops.
Memory abstractions:
- LTM (Long-Term Memory)
- STM (Short-Term Memory)
- EM (Entity Memory)
- WM (Working Memory)
- EK (External Knowledge)
Planning interface: None; schema-constrained Crew-Plan; free-form LLM-Plan injection.
Specialization: Identity framing; abstract planning; procedural guidance.
Coordination & interaction: Network topology (small-world, scale-free, star), communication patterns (edge propagation, manager–worker, environment hub), explicit collaboration primitives.
Environment modeling: Implicit (execution context); explicit world state.

3. Benchmark Components and Evaluation Pipeline

MAFBench orchestrates five complementary evaluations under a standardized pipeline:

Benchmark Domain	Subcomponents/Evaluation	Key Focus
Memory	MemoryAgentBench (AR, TTL, LRU, SF)	Memory retention, retrieval, forgetting
Planning	GSM8K, CSQA, MATH-100; NoPlan, Crew-Plan, LLM-Plan interfaces	Planning mechanism and interface effects
Specialization	CatDB tasks (Utility, WiFi, EU-IT, Yelp, Volkert); role/planning/expert strategies	Agent conditioning
Tool Use	StableToolBench (integrated, qualitative only)	Not quantitatively reported
Coordination	AGENTSNET (Coloring, Matching, VertexCover, LeaderElection, Consensus); graph/topology variants	Multi-agent interaction success

In every evaluation, MAFBench fixes LLM model, prompt templates, concurrency settings, session budgets, logging schema, and scoring logic, ensuring direct, architecture-level comparability.

4. Empirical Findings and Performance Metrics

MAFBench’s controlled experiments reveal that framework-level design choices can dramatically alter system performance. The quantitative metrics include latency ( $\mathcal{R}$ 2 total runtime / #queries), throughput ( $\mathcal{R}$ 3 #queries / total runtime), accuracy ( $\mathcal{R}$ 4 #correct / #total), planning accuracy ( $\mathcal{R}$ 5 #correct_plans / #total_plans), and coordination success ( $\mathcal{R}$ 6 #successful_runs / #total_runs).

Dimension	Metric	Best Observed	Worst Observed
Orchestration	Latency ( $\mathcal{R}$ 7 direct LLM)	1.3 $\mathcal{R}$ 8	117 $\mathcal{R}$ 9
Orchestration	Throughput (req/s)	8.9	$\mathcal{Y}$ 00.01
Memory	Memory Score	23.8%	6.1%
Planning	Accuracy $\mathcal{Y}$ 1	+15pp	–30pp
Planning	Runtime Multiplier	1.2 $\mathcal{Y}$ 2	30 $\mathcal{Y}$ 3
Specialization	F1 Score $\mathcal{Y}$ 4	+58	$\mathcal{Y}$ 50
Coordination	Success (large n)	$\mathcal{Y}$ 690%	$\mathcal{Y}$ 730%

Notable findings:

Orchestration overhead led to $\mathcal{Y}$ 8100 $\mathcal{Y}$ 9 latency increases (e.g., $\mathcal{P}$ 0 s vs $\mathcal{P}$ 1 s) and $\mathcal{P}$ 20.1 req/s throughput in GABM frameworks.
Retrieval-centric memory architectures (LangGraph, AR $\mathcal{P}$ 344.9%) substantially outperformed accumulation-only approaches (OpenAI SDK, $\mathcal{P}$ 433%) on memory recall; all frameworks were deficient in selective forgetting (SF $\mathcal{P}$ 5\%).
Schema-constrained Crew-Plan interfaces reduced planning accuracy by 30pp (e.g., MATH from 80% $\mathcal{P}$ 648%), induced 7–30 $\mathcal{P}$ 7 runtime increase, and suffered up to 85% formatting failures; free-form LLM-Plan preserved or improved accuracy at only 1.2–6.6 $\mathcal{P}$ 8 runtime cost.
Specialization via expert-guided procedural prompts augmented F1 by +58 points on classification; role/planning-based conditioning alone was ineffective ( $\mathcal{P}$ 9F1 $\mathcal{S}$ 00).
Coordination: Local tasks (Coloring, Matching) succeeded on sparse topologies ( $\mathcal{S}$ 197% success at $\mathcal{S}$ 2); global tasks (VertexCover, LeaderElection, Consensus) failed ( $\mathcal{S}$ 330% success) except on fully-connected/star topologies.

5. Actionable Design Principles and Framework Selection

The empirical study motivates several design principles:

Orchestration Overhead: Scalability is dominated by orchestration depth. Prefer shallow control flows unless multi-round interactions are essential.
Task-Semantic Memory: Architect memory to match task semantics, combining retrieval-first mechanisms for recall/abstraction with bounded accumulation for session-specific learning.
Permissive Planning: Rigid schema interfaces should be avoided, as they induce high overhead and convert correct reasoning into parse failures.
Procedural Specialization: Effective specialization demands embedding expert procedural guidance; role labels do not suffice.
Topology-Task Alignment: Communication topology must align with the information-flow needs of the task; simply increasing rounds or model size does not compensate.
Interface Primacy: System interfaces and architectural choices dominate multi-agent behavior; prompt design cannot rectify poor execution semantics.

6. Limitations and Future Research Directions

MAFBench highlights several unresolved directions:

Principled Memory Editing: Introduction of explicit, dependency-aware deletion and revision primitives for selective forgetting and knowledge updating.
Robust Planning Interfaces: Development of lightweight validation/supervision layers to handle interface variability while ensuring correctness.
Adaptive Topologies: Runtime reconfiguration of communication graphs with theoretical convergence and bounded cost guarantees.
Automated Compilation: High-level task specification compilation into optimized orchestration, memory, planning, and coordination layouts (e.g., ORCA-like abstractions).
Formal Scalability Analysis: Analytic cost models relating topology, orchestration depth, and memory semantics to performance metrics.

This suggests that future multi-agent LLM systems will require holistic architectural and interface optimization, beyond LLM and prompt improvements, to achieve robust, efficient, and scalable agentic behavior (Orogat et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAFBench.