LLM-based Multi-Agent Systems

Updated 6 January 2026

LLM-based MAS are distributed architectures where multiple LLM-powered agents collaborate through explicit messaging protocols to solve complex tasks.
These systems use modular design, dynamic agent selection, and standardized benchmarking to ensure reproducibility and performance across diverse applications.
LLM-based MAS improve efficiency, security, and fault tolerance in domains such as urban prediction, data marketplaces, and medical decision support.

LLM-based multi-agent systems (LLM-based MAS) are distributed intelligent architectures in which multiple agents, each instantiated as or powered by a LLM, collaborate to solve tasks requiring collective reasoning, planning, tool-use, critique, or specialized workflows. These systems are engineered with explicit message-passing protocols, shared environments, rigorous benchmarking, and often complex topological structures that reflect problem requirements and optimization considerations. The field has evolved from simple static role-play and chain-of-thought pipelines to dynamic, graph-adaptive ensembles, modular frameworks, and highly specialized inter-agent protocols. This article surveys state-of-the-art methodologies, architectures, benchmarks, domain applications, optimization and scaling trends, and current limitations, drawing directly from foundational open-source infrastructures and controlled experimental studies.

1. Unified Software Architectures and Messaging Protocols

Modern LLM-MAS frameworks are characterized by a modular, extensible architecture that treats every agent as a subclass or runtime instance inheriting from a common base class. MASLab (Ye et al., 22 May 2025) exemplifies this structure: all MAS algorithms subclass a unified BaseMAS, which manages LLM invocations (hosted APIs or vLLM), token/time tracking, logging, and debugging. Agents are composable modules ('Planner', 'Critic', 'Executor') that communicate via standardized message buffers. The system enforces explicit message-passing APIs:

1	AgentA.send(message) → AgentB.receive()

with shared schemas for role assignment, dynamic prompt templates, and tool invocation routines (e.g., code execution, web search, image analysis).

Inter-agent communication can be centralized, decentralized, or layered; topologies (A ⊆ Agents × Agents) are programmed either statically in code, or determined adaptively at runtime (see AMAS (Leong et al., 2 Oct 2025) and DynaSwarm (Leong et al., 31 Jul 2025)). Benchmark environments and tasks are encapsulated behind uniform Environment interfaces with query generation, ground-truth labeling, and standardized evaluation entrypoints, permitting fair, direct comparison across methods and model backends.

Global configurations enforce uniform LLM settings for all agents (model selection, temperature, token limits), ensuring that only algorithmic differences influence outcomes. Extending the system requires subclassing the base MAS, registering new methods and benchmarks, and using provided validation workflows—supporting rapid scaling, fair ablation studies, and reproducibility.

2. MAS Methodological Taxonomy: Workflows and Topologies

LLM-based MAS architectures are broadly grouped into functional domains and workflow classes:

Single-Agent Baselines: Vanilla LLM, Chain-of-Thought (CoT) prompting for stepwise reasoning (Ye et al., 22 May 2025).
Collaborative Planning: Fixed roles (CAMEL), User–Assistant loops (AutoGen) (Ye et al., 22 May 2025).
Debate and Critique: Multi-agent debate (MAD, LLM-Debate): iterative rounds where proposer and opponent agents exchange arguments; a judge agent aggregates and decides. Empirical gains scale positively with task depth and width (Tang et al., 5 Oct 2025).
Workflow Generation: Systems like AgentVerse dynamically recruit, plan, and critique, with agent selection loops implemented as LLM calls. MAS-GPT reframes MAS design as executable program synthesis by a meta-LLM (Ye et al., 5 Mar 2025).
Optimization-Based MAS: GPTSwarm, ADAS, and AFlow treat the agent interaction graph as learnable parameters, using gradient or actor-critic RL to optimize the collaboration topology for maximum downstream utility (Leong et al., 31 Jul 2025, Leong et al., 2 Oct 2025).
Tool-Augmented MAS: Agents invoke external APIs or toolkits (code, vision, web) within a message-passing protocol, as in OWL-Roleplaying or ReAct-MASLab (Ye et al., 22 May 2025).
Domain-Specific Agentic MAS: Custom agent libraries for tasks such as medical therapy (MedAgents), mathematical reasoning (MACM), scientific prediction (Urban-MAS), and simulation of strategic marketplaces (Sashihara et al., 17 Nov 2025, Lou, 30 Oct 2025, Wu et al., 15 Jul 2025).

Recent architectures exploit parallelized planning-acting to maximize real-time responsiveness and interruptibility, deploying dual-thread structures synchronized by centralized memory (Li et al., 5 Mar 2025). Blackboard systems implement shared, evolving memory objects with dynamic agent selection and consensus extraction, achieving improved token efficiency (Han et al., 2 Jul 2025).

3. Benchmarks, Metrics, and Standardized Evaluation

Benchmarks are critical to empirical evaluation, capturing both domain generality and method-specific strengths. MASLab (Ye et al., 22 May 2025) includes 10+ tasks: symbolic math (MATH, AQUA-RAT, AIME), science QA (SciBench, GPQA), common code benchmarks (HumanEval, MBPP, GAIA), and medically oriented QA (MedMCQA).

Key metrics:

Accuracy: $\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat y_i = y_i]$
Average Rank: $\mathrm{AvgR} = \frac{1}{M}\sum_{m=1}^M \mathrm{rank}_m$
Token Cost/Latency: Ratio of total tokens to queries, mean wallclock time.
Reward: Pass-rate on tool-augmented tasks, win-rate or secondary criteria in MAS-driven games (WiS Platform (Hu et al., 2024)).
Robustness: Error rate $\delta$ , resilience $R=1-\frac{1}{T}\sum_t \delta_t$ , recovery time $\tau$ post-fault (Owotogbe, 6 May 2025).
Semantic Agreement: Human–LLM protocol matching, reported $>98\%$ for LLM-based extraction, $65\%$ for best rule-based (Ye et al., 22 May 2025).

MAS frameworks such as MAESTRO (Ma et al., 1 Jan 2026) supply run-to-run reproducibility signals: structural (Jaccard edge overlap), order-aware (LCS) similarity, and full execution traces (OpenTelemetry). Failure analysis distinguishes explicit system faults from silent semantic errors.

4. Optimization, Adaptation, and Scaling

Leading LLM-MAS systems incorporate advanced optimization mechanisms, adversarial training, cross-task experiential learning, and dynamic adaptation:

Token/Brevity Optimization: Optima (Chen et al., 2024) trains agents to maximize task accuracy while minimizing token output using reward–penalized supervised or preference optimization; conversations as trees, with Monte Carlo Tree Search methods to diversify training traces.
Dynamic Topology Selection: Actor-critic RL (A2C) is used in DynaSwarm (Leong et al., 31 Jul 2025) to refine edge selection in collaboration graphs; AMAS (Leong et al., 2 Oct 2025) leverages a lightweight LoRA-adapted LLM to select among small pools of RL-optimized graphs per query, achieving statistically significant gains over static topologies.
Cross-Task Experiential Learning: MAEL (Li et al., 29 May 2025) equips each agent with an experience pool (state, action, reward triples) harvested in training. At inference, agents retrieve high-reward, context-similar exemplars via text embeddings and cosine similarity, improving sample efficiency and convergence.
Cooperative MARL: MAGRPO (Liu et al., 6 Aug 2025) models collaborative dialogue as a Dec-POMDP, optimizing joint policies via group-relative advantages and clipped PPO-style objectives. This method produces measurable improvements in both writing and code collaboration settings.

Scaling trends show monotonic improvements in accuracy and reliability for most methods with increasing LLM size and sample count; however, some architectures (AgentVerse) exhibit threshold effects, with format errors dominating at small model sizes (Ye et al., 22 May 2025).

5. Domain Applications: Urban Prediction, Data Marketplaces, Therapy Recommendation

LLM-based MAS have demonstrated applicability across heterogeneous real-world domains:

Urban AI: Urban-MAS (Lou, 30 Oct 2025) integrates deep-research agents for factor prioritization, extraction agents for robust feature acquisition (consistency checks and re-extraction by similarity), and fusion agents for multi-dimensional inference (running amount, perception metrics), yielding up to 51% error reduction in perception tasks.
Data Marketplaces: Simulated environments with strategic buyer/seller agents (Sashihara et al., 17 Nov 2025). Agents reason in natural language (“think aloud” via chain-of-thought), execute budgeted transactions, update prices, and model demand. Distributional metrics (e.g., purchase counts, buyer repeat rate) closely mirror real-world market trends.
Medical Decision Support: Multi-disciplinary therapy MAS (Wu et al., 15 Jul 2025) enables conflict resolution in multimorbidity cases by partitioning tasks among specialist LLMs. Evaluation uses correctness/completeness, DDI ratio, conflict ratio, medication burden, and clinical goals met, with detailed error analyses and ablation studies comparing single-agent to MAS protocols.

Domain-specific MAS consistently outperform single-agent baselines on complex, multi-factor problems, particularly when agent decomposition and inter-agent verification protocols are rigorously designed.

6. Robustness, Security, and Fault Tolerance

Robustness is assessed via chaos engineering (LLM hallucination, crash, communication fault injection) and security-oriented adversarial analysis:

Chaos Framework: Quantifies error rates, resilience, recovery times under controlled agent failures and message loss (Owotogbe, 6 May 2025). Agent redundancy and token-level semantic checks halve error rates and decrease recovery times by over 70%.
Securing MAS: AgentShield (Wang et al., 28 Nov 2025) introduces three-layer defense: critical node auditing via combined graph centrality/task-contribution; light token auditing through strict-sentry models; and two-round consensus with heavyweight arbiters. This achieves 92.5% recovery rate with 70% lower overhead compared to traditional majority voting.
Topology-Guided Remediation: G-Safeguard (Wang et al., 16 Feb 2025) constructs multi-agent utterance graphs, applies edge-featured GNN detection, and prunes compromised nodes’ outgoing edges, recovering >40% lost accuracy under prompt injection.
Intention-Hiding Threats: AgentXposed (Xie et al., 7 Jul 2025) develops HEXACO-based drift scoring and adaptive interrogation protocols to detect malicious agents across centralized, decentralized, and layered topologies, highlighting unique vulnerabilities and cost-inflation strategies.

Security-focused MAS development now routinely involves trace-level logging, topological vulnerability analysis, and distributed, decentralized auditing to counter both overt and covert agent compromise.

7. Observability, Evaluation Tools, and Future Directions

Recent frameworks, such as MAESTRO (Ma et al., 1 Jan 2026), have standardized MAS execution and instrumented controlled comparison, integrating third-party MAS via adapters, exporting complete telemetry, and rigorously analyzing call-graph stability, resource consumption, and failure signatures.

Empirical findings underscore that MAS architecture exerts a greater influence on cost, latency, accuracy, and reproducibility than base model upgrades or tool choices. Richer workflows increase semantic failure rates and may penalize final accuracy unless carefully designed and debugged. End-to-end signal harvesting enables fine-grained triage and reproducible optimization.

Limitations remain: coverage of novel MAS methods is incomplete, and many frameworks lack challenging long-horizon planning benchmarks or comprehensive user studies. Ongoing research targets scaling to larger agent societies, deeper retrieval-augmented models, hierarchical consensus, and benchmark diversification.

The methodological and empirical rigor enforced by unified environments, dynamic topologies, principled security, and quantitative observability provides a practical and theoretical foundation for designing robust, efficient, and high-performance LLM-based MAS. As new agent classes, benchmarks, security threats, and optimization paradigms emerge, research-centric codebases (MASLab, MAESTRO) and advanced frameworks (AMAS, DynaSwarm, Optima) will continue to track, validate, and extend the collective intelligence frontier of LLM-based multi-agent systems.