LLM-Based Multi-Agent Systems Overview

Updated 19 August 2025

LLM-MAS are distributed AI architectures where multiple LLM-powered agents interact via structured communication protocols to collaboratively solve complex tasks.
These systems implement diverse architectures—from flat to hierarchical—and leverage advanced evaluation metrics, reinforcement learning, and security frameworks to optimize performance.
LLM-MAS demonstrate practical applications in domains like chemical engineering and pest management, showcasing benefits such as modularity, collective intelligence, and adaptive coordination.

LLM-based multi-agent systems (LLM-MAS) are distributed artificial intelligence architectures in which multiple agents—each powered by one or more LLMs—interact, collaborate, and/or compete to solve complex tasks. LLM-MAS leverage the compositional, communicative, and strategic emergent behaviors of multiple LLM agents organized via explicit communication protocols and workflows. These systems are advancing rapidly across technical domains due to their modularity, scalability, and potential for collective intelligence, as reviewed and implemented in recent research across evaluation platforms, security frameworks, reinforcement learning, and domain-specific applications.

1. Foundations and Collaboration Frameworks

LLM-MAS are formally structured as collections of interacting LLM-driven agents, with the system behavior determined by the agents, their objectives, the environment, and inter-agent collaboration channels. Each agent is characterized as a tuple $(m, o, e, x, y)$ , representing its LLM, current objective, environment, input perceptions, and output, respectively. The collection of agents $\mathcal{A}$ and communication channels $\mathcal{C}$ define the overall system $S$ . The collaborative process is captured by the LaTeX formalism:

$y_\text{collab} = S(\mathcal{O}_\text{collab}, \mathcal{A}, \mathcal{C}, \mathcal{e}, x_\text{collab}) = \{ c_j(\{ a_i(o_i, \mathcal{e}, x_i) : a_i, o_i, x_i \in c_j \}) : c_j \in \mathcal{C}\}$

Channels can be tailored for cooperation (shared objectives), competition (conflicting objectives), or coopetition (hybrid scenarios), and system topology may be centralized, decentralized, or hierarchical (Tran et al., 10 Jan 2025). The design of coordination protocols—ranging from static rule-based to dynamic, role-based, or probabilistic policies—remains an open research area, particularly as workflows scale in heterogeneity and complexity.

2. Communication, Information Flow, and Architecture

System-level and internal communication underpin the operation of LLM-MAS. Architectures fall into five major classes: flat (peer-to-peer), hierarchical, team-oriented, society-inspired, and hybrid forms (Yan et al., 20 Feb 2025). Internal communication is implemented via sequential (one-by-one), simultaneous, and summarizer-mediated strategies. Paradigms include:

Message Passing: Natural language or structured data exchanges.
Speech Act: Performative language triggering actions or state changes.
Blackboard Model: Shared memory for posting, updating, and synthesizing information, with specialized agents (e.g., planners, critics, deciders, cleaners) accessing public/private blackboard spaces governed by control units (Han et al., 2 Jul 2025).

This modularity allows dynamic agent roles and iterative consensus formation. Strengths of these architectures include modularity, flexible interaction modalities, and support for both explicit and implicit communication. Limitations manifest as scalability constraints in flat architectures, bottlenecks in centralized or blackboard systems, and vulnerabilities in open or hybrid communication channels.

3. Evaluation Methodologies and Benchmarking

LLM-MAS demand novel evaluation protocols due to the high dimensionality of agent interactions and emergent behaviors. The WiS platform (Hu et al., 4 Dec 2024) exemplifies open, scalable benchmarking via game-based adversarial scenarios (“Who is Spy?”), supporting Hugging Face models through unified interfaces, real-time leaderboards, and multidimensional evaluation metrics that span win/loss, attack and defense strategies, and explicit reasoning analysis. Key experimental findings show:

Significant behavioral differentiation between models in both civilian and spy roles.
Efficacy of explicit chain-of-thought reasoning in enhancing both individual and collective civilian performance.
Vulnerability to adversarial prompt injections affecting vote accuracy and foul rates.

MASLab (Ye et al., 22 May 2025) addresses reproducibility and fair performance comparisons by standardizing over 20 MAS frameworks in a unified codebase with rigorous output validation and consistent evaluation functions (e.g., f(input; θ) = Output). Benchmarks cover mathematics, coding, science, knowledge, and medical tasks; results demonstrate no single dominant method, importance of evaluation protocol standardization, and distinct trade-offs between scaling and computational cost.

4. Security, Robustness, and Vulnerability Management

The compositional nature of LLM-MAS introduces unique security threat surfaces, including inter-agent communication, trust relationships, and tool integrations (He et al., 2 Jun 2025). Notably, vulnerabilities in one component may propagate through message passing, causing system-wide disruptions. Formally, attacker objectives are framed as optimization over malicious perturbations:

$\underset{s \in \Theta_s}{\mathrm{argmax}}\ \mathrm{Evaluator}(S_\text{mal}, Q, G)$

where $\Theta_s$ defines the attack subspace, $\mathrm{Evaluator}$ quantifies malicious objectives (e.g., harmful outputs, resource exhaustion), $S_\text{mal}$ the compromised system, $Q$ the initial query, and $G$ the adversarial goal.

Defensive mechanisms include:

G-Safeguard: Topology-guided, GNN-based anomaly detection and graph intervention (edge pruning) to cut off malicious message propagation, recovering over 40% performance against prompt injection across various structures (Wang et al., 16 Feb 2025).
BlindGuard: Unsupervised, attack-agnostic defense using hierarchical agent encoders and corruption-based contrastive learning. BlindGuard’s hierarchical context encoding detects anomalous agents using only normal agent behavior and generalizes well to unknown attacks and topologies (Miao et al., 11 Aug 2025).

Detection frameworks now incorporate behavioral and psychological cues (e.g., AgentXposed, leveraging HEXACO traits and the Reid technique (Xie et al., 7 Jul 2025)) to flag intention-hiding attackers employing modalities such as suboptimal fixation, reframing misalignment, fake injection, and execution delay, especially across decentralized, centralized, or layered communication topologies.

5. Learning, Optimization, and Dynamic Coordination

Recent advances exploit reinforcement learning (RL) and multi-agent reinforcement learning (MARL) paradigms for end-to-end policy optimization. Algorithms such as Multi-Agent Heterogeneous Group Policy Optimization (MHGPO) (Chen et al., 3 Jun 2025) and Multi-Agent Group Relative Policy Optimization (MAGRPO) (Liu et al., 6 Aug 2025) address limitations of traditional actor-critic methods (e.g., MAPPO), which suffer from critic instability and computational burden. In these frameworks:

Relative Reward Advantage is computed over rollout groups (removing the need for explicit critic networks), with advantages normalized as

$\widehat{A}_i = \frac{R_i - \mathrm{mean}(\{R_i\})}{\mathrm{std}(\{R_i\})}$

Group Rollout Strategies: Independent, fork-on-first, and round-robin grouping balance convergence speed, inter-agent interaction modeling, and computational efficiency. MHGPO achieves higher accuracy, F1, and computational efficiency on multi-agent search tasks.
Centralized Training with Decentralized Execution (CTDE): Policies are optimized using joint return estimates, but executed independently by agents using only their local context.

Cross-task experiential learning (MAEL) facilitates few-shot enhancement of agent reasoning by systematically retrieving, rewarding, and recycling previous high-quality decision-step experiences, achieving faster convergence and higher solution quality on tasks ranging from mathematics to software project generation (Li et al., 29 May 2025).

6. Heterogeneity, Specialization, and Domain Applications

LLM-MAS built with heterogeneous agent backbones (using different LLMs for different roles) outperform homogeneous counterparts. The X-MAS paradigm (Ye et al., 22 May 2025) demonstrates that system intelligence can be elevated to the "collective intelligence" of its constituent LLMs, resulting in gains up to 8.4% (single-domain chatbot-only) and up to 47% (mixed chatbot-reasoner on AIME) without requiring architectural modification. The X-MAS-Bench testbed systematically maps 27 LLMs to five functional agent classes (QA, revise, aggregation, planning, evaluation) across five domains and 1.7 million evaluations, highlighting the synergetic gains of intelligent model-role assignment.

Domain-specific LLM-MAS have shown promise in:

Chemical engineering: Agents orchestrate process design, control, simulation, and documentation via modular, tool-aware architectures integrated with domain-specific databases and knowledge graphs, though challenges in architectural flexibility, data integration, domain foundation model building, and sustainability remain (Rupprecht et al., 11 Aug 2025).
Pest management: Editorial workflows using specialized Editor, Retriever, and Validator agents collaboratively synthesize context-sensitive recommendations, improving decision accuracy from 86.8% to 92.6% via validation agent intervention (Shi et al., 14 Apr 2025).

7. Open Research Directions and Future Trajectories

LLM-MAS research is rapidly evolving, but several grand challenges persist:

Scalable Coordination and Governance: Designing dynamic orchestration, robust role assignment, and resource management strategies as the number of agents and complexity of tasks grow (Tran et al., 10 Jan 2025).
Robust Security and Trust: Building layered, attack-agnostic defense and trust management systems tailored to multi-agent messaging, composite attack strategies, and communication topologies (He et al., 2 Jun 2025, Wang et al., 16 Feb 2025, Miao et al., 11 Aug 2025).
Transparent, Responsible AI: Ensuring transparency, safety, ethical compliance, and environmental responsibility—especially as LLM-MAS are deployed in domain-critical applications (Rupprecht et al., 11 Aug 2025).
Evaluation and Benchmarking: Developing advanced benchmarks that rigorously quantify security, scalability, coordination, and emergent behavior in realistic and adversarial settings (Hu et al., 4 Dec 2024, Ye et al., 22 May 2025).
Automated and Adaptive Agent Composition: Leveraging automated agent assignment, dynamic selection, and learning frameworks (e.g., X-MAS, MAEL) to support robust adaptation to novel task distributions and dynamic environments (Ye et al., 22 May 2025, Li et al., 29 May 2025).

The integration of parallelized planning-acting (Li et al., 5 Mar 2025), dynamic task graph-driven architectures (Yu et al., 10 Mar 2025), and communication-centric designs (Yan et al., 20 Feb 2025) further enables LLM-MAS to approach the requisite levels of coordination, responsiveness, and adaptability for complex, high-value deployment scenarios.

In summary, LLM-MAS now encompass a variety of system architectures, learning paradigms, security frameworks, and domain specializations. Ongoing research seeks to harmonize collaborative intelligence, scalability, robustness, and interpretability, paving the way for collective LLM intelligence that approaches the complexity and reliability required for real-world, multi-agent artificial intelligence applications.