LLM-Based Multi-Agent Systems
- LLM-based MAS are systems where multiple autonomous agents, powered by LLMs, coordinate to solve tasks beyond a single model's capacity.
- They leverage diverse architectures and specialized roles to enhance reasoning, planning, and decision-making across complex workflows.
- Robustness and scalability are achieved through modular designs, adaptive communication, and rigorous evaluation frameworks.
LLM-based Multi-Agent Systems (LLM-MAS) are systems in which multiple autonomous agents—each controlled or augmented by LLMs—cooperate, coordinate, or compete to solve complex tasks that exceed the capability or reliability of any single model. These systems form the basis of scalable AI architectures for reasoning, planning, decision-making, software generation, formalization, and collaboration across diverse domains.
1. Fundamental Architectures and Design Principles
Topologies and Communication Patterns
LLM-MAS rely on explicit or dynamically constructed collaboration structures. Conventional approaches use fixed, handcrafted topologies—such as centralized (hub-and-spoke), decentralized (peer-to-peer), hierarchical (layered), or chain/graph-based workflows—to define information flow, agent roles, and aggregation strategies (Leong et al., 2 Oct 2025, Leong et al., 31 Jul 2025). Dynamic architectures have emerged, including:
- Adaptive Graphs: Dynamic graph designers (AMAS, DynaSwarm) select optimal agent communication topologies for each input or task instance using parameter-efficient adaptation (LoRA) of LLM backbones (Leong et al., 2 Oct 2025, Leong et al., 31 Jul 2025).
- Blackboard Architectures: All agent communications are mediated through a global shared blackboard, enabling full transparency and dynamic, context-driven agent orchestration (LbMAS) (Han et al., 2 Jul 2025).
Heterogeneity and Role Specialization
While early LLM-MAS systems assigned all agents the same LLM, recent frameworks deploy heterogeneous LLMs—assigning different models to roles based on domain or functional specialization (chatbot, reasoner, planner, etc.). This approach, exemplified by X-MAS, leverages LLM diversity and yields measurable improvements (up to +47% on AIME-2024) over homogeneous MAS, without structural redesigns (Ye et al., 22 May 2025).
Modular and Extensible Design
Frameworks emphasize plug-and-play extensibility through abstract base classes for agents, LLMs, knowledge bases, retrievers, and tools (e.g., MASA’s architecture for autoformalization (Zhang et al., 10 Oct 2025)). Composability and shared libraries facilitate fast adaptation to evolving tasks, new agent types, or external symbolic resources (e.g., theorem provers, code sandboxing).
2. Coordination, Workflow, and Optimization
Coordination Patterns
LLM-MAS coordination can be instantiated via direct engineering (chains, trees, pipelines), indirectly through process models, or learned optimization. Distinct paradigms include:
- Classical Process Modeling: Software engineering-inspired patterns (Waterfall, V-Model, Agile) map to agent handoff, validation, and iteration, each entailing trade-offs between speed, robustness, and code/output quality (Ha et al., 17 Sep 2025).
- Iterative and Refined Optimization: OMAC provides a theoretically grounded framework for optimizing both agent functionality and collaboration structure across five dimensions, using LLM-based semantic exploration and contrastive evaluation loops to synthesize higher-performing and more coherent MAS (Li et al., 17 May 2025).
- Parallelized and Interleaved Execution: High-performance real-time systems deploy dual-thread architectures for planning and acting with interruptibility, central memory, and skill libraries, reducing latency and supporting coordinated adaptation to dynamic environments (e.g., in Minecraft) (Li et al., 5 Mar 2025).
Experience and Continual Learning
Experience replay and cross-task memory are introduced by MAEL, where agents accumulate and retrieve rewarded task step traces as few-shot exemplars. Step-wise or task-wise retrieval improves convergence and output quality, especially on structurally recurring or long-horizon tasks (Li et al., 29 May 2025).
3. Robustness, Safety, and Security
Chaos Engineering for Robustness
LLM-MAS are subject to unique vulnerabilities such as hallucination propagation, cascading agent failures, and emergent communication breakdowns. Application of chaos engineering systematically injects faults—agent dropouts, message corruption, hallucination triggers, resource contention—into sandboxed environments to uncover and subsequently mitigate weak points in agent design, inter-agent protocols, or overall system observability (Owotogbe, 6 May 2025).
Evaluation Metrics:
Security Threat Modeling and Defense
LLM-MAS are vulnerable to novel intention-hiding attacks, including suboptimal fixation, reframing misalignment, fake information injection, and execution delays (Xie et al., 7 Jul 2025). These attacks can degrade collaborative outcomes without obvious misbehavior. Psychological profiling (e.g., HEXACO-based AgentXposed) and graph-topological anomaly detection (G-Safeguard via GNNs) provide robust defense by monitoring agent behavioral deviations and interrupting the spread of malicious content (Xie et al., 7 Jul 2025, Wang et al., 16 Feb 2025).
Evaluation Platforms and Behavioral Analysis
To benchmark LLM-MAS beyond synthetic tasks, open platforms such as WiS use structured games ("Who is Spy?") to evaluate reasoning, deception, defense, and collaboration, with dynamic leaderboards and granular, behavior-tracking analytics (Hu et al., 4 Dec 2024).
4. Learning, Adaptation, and Scalability
Reinforcement and Preference Optimization
Efficient communication and effectiveness are jointly optimized in frameworks such as Optima, which uses an iterative generate-rank-select-train cycle, with multi-objective reward functions that penalize verbosity and encourage performance, interpretability, and token efficiency (Chen et al., 10 Oct 2024). Techniques involve supervised fine-tuning, direct preference optimization, and MCTS-inspired sampling for paired preference mining in tree-structured MAS dialogs.
Reward function:
Group-Based RL for MAS Optimization
MHGPO introduces critic-free, group-based MARL for LLM-MAS, using relative rewards within agent rollout groups to robustly and scalably estimate policy gradients. Sampling strategies (independent, fork-on-first, round-robin) trade off sample diversity and efficiency, consistently outperforming standard MAPPO in task performance and resource use across multi-hop QA and search tasks (Chen et al., 3 Jun 2025).
Group advantage:
Hybrid MARL-alignment (MAGRPO) and Dec-POMDP formalizations further address the challenge of learning cooperative policies in decentralized, partially observable, and language-action space environments (Liu et al., 6 Aug 2025).
5. Evaluation, Benchmarking, and Generalization
Theoretical Task Complexity and MAS Gains
A principled framework decomposes task complexity into depth (sequential reasoning steps) and width (required parallel capabilities), showing that LLM-MAS gain most over single-agent LLMs (LLM-SAS) on tasks of high depth. The benefit from collaborating agents increases with both depth and width, but the effect is more pronounced with depth, due to error correction and diversification (Tang et al., 5 Oct 2025).
Success Rate (MAS vs. SAS):
Unified Codebases and Standardized Evaluation
MASLab provides a rigorously validated, open-source codebase with standardized implementations, benchmarks, and evaluation protocols for 20+ MAS methods—removing confounds from data wrangling, parameterization, or brittle rule-based grading. This enables meaningful, transparent comparison and advances reproducibility (Ye et al., 22 May 2025).
Generative MAS Construction
Rather than manual MAS design or prompt engineering, generative paradigms (MAS-GPT) cast MAS construction as a language modeling task—producing executable code for query-adaptive MAS generation in one shot, reducing both development effort and inference cost, with consistent, robust out-of-domain performance (Ye et al., 5 Mar 2025).
6. Creative and Specialized Applications
Creative and Open-ended Generation
LLM-MAS are extensively used for creative tasks (writing, art, ideation), structured using explicit agent personas, divergent exploration, iterative refinement, and collaborative synthesis to maximize output novelty, diversity, and coherence. Unique coordination and persona modeling as well as creative evaluation metrics are required; challenges include standardization, conflict resolution, and bias mitigation (Lin et al., 27 May 2025).
Specialized Domains: Software Engineering and Mathematics
Frameworks such as MASA (for mathematical autoformalization) orchestrate LLM agents for generation, hard/soft critique, refinement, and tool integration (theorem provers, KBs), demonstrating robust modularity, interpretable workflows, and significant gains in syntactic/semantic correctness on formal mathematics tasks (Zhang et al., 10 Oct 2025). Process model alignment in code generation and complex software projects (MetaGPT with Waterfall, V-Model, Agile) exposes trade-offs between artifacts, cost, and quality (Ha et al., 17 Sep 2025).
LLM-based Multi-Agent Systems have undergone rapid evolution, moving from static topologies and heuristic coordination to adaptive, optimized, and robust architectures that leverage both the diversity of LLM backbones and dynamic, context-sensitive collaboration. State-of-the-art research now focuses on scaling to diverse, complex tasks; enabling unified, extensible platforms; optimizing for robustness and security; and exploring principled frameworks for understanding when and how multi-agent configurations provide decisive performance gains over single-agent approaches.