MAFBench: Multi-Agent Framework Benchmark
- MAFBench is a unified evaluation suite for multi-agent LLM frameworks that isolates architectural effects from model quality and prompt engineering.
- It standardizes benchmarks across memory, planning, specialization, and coordination to compare orchestration overhead and design choices in diverse environments.
- Empirical results reveal that framework-level decisions dramatically influence latency, throughput, and task success, offering actionable insights for system optimization.
MAFBench (Multi-Agent Framework Benchmark) is a unified evaluation suite designed for systematic, controlled comparison of multi-agent LLM frameworks. Its primary aim is to elucidate how framework-level architectural choices—distinct from LLM model quality or prompt engineering—govern system performance, cost, and robustness. MAFBench integrates previously isolated benchmarks under a standardized pipeline, allowing direct measurement of the impact of orchestration, memory abstractions, planning interfaces, agent specialization, and coordination mechanisms across heterogeneous multi-agent environments. By fixing underlying model parameters, prompts, and data, and varying only the framework layer, MAFBench isolates the architectural effects that induce substantial variations in system behavior and efficiency (Orogat et al., 3 Feb 2026).
1. Definition, Motivation, and Scope
MAFBench addresses the lack of standardized, framework-level benchmarks for multi-agent LLM systems. While existing benchmarks focus on capabilities such as tool use, retrieval, or reasoning at the agent level, they are not designed for cross-framework comparison. MAFBench brings together benchmarks covering memory, planning, specialization, and coordination within a single execution and logging environment.
Motivation for MAFBench originates from the necessity to compare architectural choices, such as orchestration overhead, memory strategies, and interaction topologies, which alone can cause order-of-magnitude differences in latency, throughput, accuracy, and scalability. MAFBench achieves this by holding LLM(s), prompts, and input data constant and varying only the framework implementation.
Core to MAFBench are two formal definitions:
- Agent: , where is role/specialization, objectives, planning, storage/memory, tools, and the LLM reasoning function.
- Framework: , with agents, orchestration/control flow, 0 communication topology, and 1 optional environment.
2. Architectural Taxonomy
MAFBench classifies frameworks along two axes: Architectural Paradigms and Design Dimensions.
Architectural Paradigms
- Graph-Based (e.g., LangGraph): Explicit DAG workflows determine control/data flow.
- Role-Based (e.g., CrewAI, AutoGen, OpenAI SDK): Coordination via textual role specifications and manager–worker delegation.
- GABM (e.g., Concordia): Environment-mediated agent interactions without direct peer messaging.
Fundamental Design Dimensions
- Orchestration/control flow: Fixed DAG; role-conditioned; emergent loops.
- Memory abstractions:
- Planning interface: None; schema-constrained Crew-Plan; free-form LLM-Plan injection.
- Specialization: Identity framing; abstract planning; procedural guidance.
- Coordination & interaction: Network topology (small-world, scale-free, star), communication patterns (edge propagation, manager–worker, environment hub), explicit collaboration primitives.
- Environment modeling: Implicit (execution context); explicit world state.
3. Benchmark Components and Evaluation Pipeline
MAFBench orchestrates five complementary evaluations under a standardized pipeline:
| Benchmark Domain | Subcomponents/Evaluation | Key Focus |
|---|---|---|
| Memory | MemoryAgentBench (AR, TTL, LRU, SF) | Memory retention, retrieval, forgetting |
| Planning | GSM8K, CSQA, MATH-100; NoPlan, Crew-Plan, LLM-Plan interfaces | Planning mechanism and interface effects |
| Specialization | CatDB tasks (Utility, WiFi, EU-IT, Yelp, Volkert); role/planning/expert strategies | Agent conditioning |
| Tool Use | StableToolBench (integrated, qualitative only) | Not quantitatively reported |
| Coordination | AGENTSNET (Coloring, Matching, VertexCover, LeaderElection, Consensus); graph/topology variants | Multi-agent interaction success |
In every evaluation, MAFBench fixes LLM model, prompt templates, concurrency settings, session budgets, logging schema, and scoring logic, ensuring direct, architecture-level comparability.
4. Empirical Findings and Performance Metrics
MAFBench’s controlled experiments reveal that framework-level design choices can dramatically alter system performance. The quantitative metrics include latency (2 total runtime / #queries), throughput (3 #queries / total runtime), accuracy (4 #correct / #total), planning accuracy (5 #correct_plans / #total_plans), and coordination success (6 #successful_runs / #total_runs).
| Dimension | Metric | Best Observed | Worst Observed |
|---|---|---|---|
| Orchestration | Latency (7 direct LLM) | 1.38 | 1179 |
| Orchestration | Throughput (req/s) | 8.9 | 00.01 |
| Memory | Memory Score | 23.8% | 6.1% |
| Planning | Accuracy 1 | +15pp | –30pp |
| Planning | Runtime Multiplier | 1.22 | 303 |
| Specialization | F1 Score 4 | +58 | 50 |
| Coordination | Success (large n) | 690% | 730% |
Notable findings:
- Orchestration overhead led to 81009 latency increases (e.g., 0 s vs 1 s) and 20.1 req/s throughput in GABM frameworks.
- Retrieval-centric memory architectures (LangGraph, AR344.9%) substantially outperformed accumulation-only approaches (OpenAI SDK, 433%) on memory recall; all frameworks were deficient in selective forgetting (SF5\%).
- Schema-constrained Crew-Plan interfaces reduced planning accuracy by 30pp (e.g., MATH from 80%648%), induced 7–307 runtime increase, and suffered up to 85% formatting failures; free-form LLM-Plan preserved or improved accuracy at only 1.2–6.68 runtime cost.
- Specialization via expert-guided procedural prompts augmented F1 by +58 points on classification; role/planning-based conditioning alone was ineffective (9F100).
- Coordination: Local tasks (Coloring, Matching) succeeded on sparse topologies (197% success at 2); global tasks (VertexCover, LeaderElection, Consensus) failed (330% success) except on fully-connected/star topologies.
5. Actionable Design Principles and Framework Selection
The empirical study motivates several design principles:
- Orchestration Overhead: Scalability is dominated by orchestration depth. Prefer shallow control flows unless multi-round interactions are essential.
- Task-Semantic Memory: Architect memory to match task semantics, combining retrieval-first mechanisms for recall/abstraction with bounded accumulation for session-specific learning.
- Permissive Planning: Rigid schema interfaces should be avoided, as they induce high overhead and convert correct reasoning into parse failures.
- Procedural Specialization: Effective specialization demands embedding expert procedural guidance; role labels do not suffice.
- Topology-Task Alignment: Communication topology must align with the information-flow needs of the task; simply increasing rounds or model size does not compensate.
- Interface Primacy: System interfaces and architectural choices dominate multi-agent behavior; prompt design cannot rectify poor execution semantics.
6. Limitations and Future Research Directions
MAFBench highlights several unresolved directions:
- Principled Memory Editing: Introduction of explicit, dependency-aware deletion and revision primitives for selective forgetting and knowledge updating.
- Robust Planning Interfaces: Development of lightweight validation/supervision layers to handle interface variability while ensuring correctness.
- Adaptive Topologies: Runtime reconfiguration of communication graphs with theoretical convergence and bounded cost guarantees.
- Automated Compilation: High-level task specification compilation into optimized orchestration, memory, planning, and coordination layouts (e.g., ORCA-like abstractions).
- Formal Scalability Analysis: Analytic cost models relating topology, orchestration depth, and memory semantics to performance metrics.
This suggests that future multi-agent LLM systems will require holistic architectural and interface optimization, beyond LLM and prompt improvements, to achieve robust, efficient, and scalable agentic behavior (Orogat et al., 3 Feb 2026).