LLM-Based Multi-Agent Systems (LLM-MAS)

Updated 25 August 2025

LLM-MAS are systems where multiple agents powered by large language models collaborate or compete using diverse communication paradigms to tackle multi-faceted problems.
They integrate structured architectures, memory augmentation, and reinforcement learning to optimize decision-making and scale efficiently.
These systems enhance collective intelligence and robustness by dynamically generating agents, securing inter-agent communications, and benchmarking performance across varied application domains.

LLM-Based Multi-Agent Systems (LLM-MAS) are systems in which multiple agents—each powered by a LLM—cooperate, coordinate, or compete to solve complex, multi-faceted tasks. By leveraging the advanced reasoning, language understanding, and generative capabilities of LLMs within diverse agent architectures, LLM-MAS are enabling new forms of collective intelligence, adaptive decision-making, and large-scale automated workflows across scientific, industrial, and social domains.

1. Architectural Paradigms and Communication Structures

Research in LLM-MAS encompasses a wide array of agent architectures and communication topologies. Canonical designs include flat, hierarchical, team-based, society-inspired, and hybrid models (Yan et al., 20 Feb 2025). Flat architectures foster decentralized peer-to-peer interactions but exhibit poor scalability as agent count increases, whereas hierarchical frameworks delegate planning and coordination to one or more top-level controller agents, as exemplified by ChatDev or HALO (Hou et al., 17 May 2025).

Communication paradigms within LLM-MAS are diverse:

Message Passing (explicit natural language or structured data transmissions between agents)
Speech Act mechanisms (where utterances act as commitments, commands, or queries)
Blackboard models (shared memory or context repositories accessible by designated agents) (Yan et al., 20 Feb 2025).

Communication strategies are designed for various modes, such as One-by-One (turn-based), Simultaneous-Talk (parallel exchange), or Simultaneous-Talk-with-Summarizer (with intermediate context aggregation). The system-level goals—cooperation, competition, or mixed—are encoded in both communication workflows and agent incentivization schemes.

2. Agent Design Principles: Profiling, Memory, and Adaptation

Agent profiling in LLM-MAS can be pre-defined, model-generated, or data-derived (Guo et al., 21 Jan 2024). Agents are assigned specialized roles (e.g., "programmer", "planner", "evaluator") and behavioral parameters, which may capture tendencies toward exploration, exploitation, or even personality traits (as in HEXACO-modeled security frameworks (Xie et al., 7 Jul 2025)). Memory systems typically combine short-term and long-term repositories—recent trajectories, historical actions, or external knowledge vectors—enabling agents to ground decisions on rich situational context (Zhang et al., 2023).

Capacity growth is realized via three principal mechanisms:

Memory: Agents incorporate historical records using in-context attention or external vector databases.
Self-Evolution: Agents autonomously update their behavioral parameters and planning strategies through feedback from other agents or the environment (self-collaboration, communication-driven learning).
Dynamic Agent Generation: New agents are instantiated on demand to meet changing requirements, enabling elastic scaling in both size and functionality (Guo et al., 21 Jan 2024).

3. Computational and Decision-Making Frameworks

Several computational abstractions and control strategies have emerged:

Actor-Critic and Distributional RL frameworks underpin scalable decision-making by using centralized critics to coordinate decentralized actor agents. The LLaMAC design (Zhang et al., 2023) exemplifies this paradigm: a TripletCritic module (explore-focused critic, exploit-focused critic, and an assessor) guides actors through iterative, distributionally coded value estimation and internal/external feedback, mitigating hallucination and optimizing token usage.
Hierarchical Reasoning and Dynamic Task Decomposition: HALO (Hou et al., 17 May 2025) introduces a three-level hierarchy (planner, role-designer, inference agents), leveraging Monte Carlo Tree Search (MCTS) to optimize reasoning trajectories for each subtask, with dynamic prompt refinement by collaborative agents.
Parallel/Asynchronous Execution: Dual-thread (planning-acting) architectures and dynamic task graph engines enable concurrent planning and execution, maximizing throughput and enabling real-time, adaptive response to environment changes (Li et al., 5 Mar 2025, Yu et al., 10 Mar 2025).
Reinforcement Learning for MAS Optimization: Critic-free, group-based MARL algorithms such as MHGPO (Chen et al., 3 Jun 2025) enable stable, scalable fine-tuning of LLM-MAS, bypassing value estimation overhead and using group-based relative advantage estimation to accelerate convergence.

4. Security, Robustness, and Vulnerability Analysis

LLM-MAS introduce unique attack surfaces due to their distributed nature, inter-agent communication, and integration with external tools or APIs (He et al., 2 Jun 2025). Security challenges include:

Adversarial Attacks: Prompt injection, memory poisoning, and tool exploitation can propagate through the network via inter-agent communications. The G-Safeguard framework (Wang et al., 16 Feb 2025) utilizes graph neural networks to monitor multi-agent utterance graphs, detect anomalies, and prune communication edges to contain adversarial propagation.
Cascading Vulnerabilities: Compositional effects, where an individual agent's compromise leads to large-scale system dysfunction, demand a rigorous, component-level threat model. The formalization $\arg\max_{s \in \Theta_S} \text{Evaluator}(S_{ma}, Q, G)$ allows for optimization-driven vulnerability quantification and supports realistic attacker scenarios (white-, black-, or gray-box) (He et al., 2 Jun 2025).
Intention-Hiding and Covert Disruption: Subtle attacks (suboptimal fixation, reframing misalignment, fake injection, execution delay) can degrade MAS performance while remaining concealed. AgentXposed (Xie et al., 7 Jul 2025) combines HEXACO-based trait profiling with behavior-triggered interrogation, effectively identifying hidden adversarial agents even in high-concealment scenarios.

Chaos engineering frameworks (Owotogbe, 6 May 2025) offer resilience-testing by systematically introducing controlled failures (hallucinations, agent unresponsiveness, communication faults), measuring resulting robustness via quantitative resilience metrics and integrating utility theory to balance reliability improvements with disruption overhead.

5. Optimization, Scalability, and Heterogeneity in LLM-MAS

Systematic MAS optimization frameworks like OMAC (Li et al., 17 May 2025) decompose the design and tuning process into functional and structural dimensions: optimizing prompts, agent construction, team selection, dynamic participation, and communication routing through iterative LLM-based contrastive reasoning. Joint optimization cycles—first across individual functional or structural factors, then in concert—have delivered measurable performance gains (e.g., 1% Pass@1 absolute in code generation against strong baselines).

Heterogeneous LLM-MAS, as articulated in X-MAS (Ye et al., 22 May 2025), address the limitations of homogeneous model deployment by assigning distinct LLM backbones to different agents according to their roles and local tasks, harnessing their specialized strengths. Empirical evidence (e.g., 8.4% accuracy improvement on MATH, 47% boost on AIME in mixed chatbot-reasoner setups) shows that empirical, role-wise LLM assignment—benchmarked using systems like X-MAS-Bench—yields substantial performance elevation without architectural redesign.

Parallel execution and dynamic task management technologies (e.g., DynTaskMAS (Yu et al., 10 Mar 2025)) have enabled near-linear scalability (up to 3.5X throughput for a 4X agent increase), with advanced task scheduling and adaptive resource allocation strategies, facilitating efficient utilization and generalization across complex, real-time scenarios.

6. Application Domains and Empirical Evaluation

LLM-MAS have been operationalized in diverse environments and problem domains (Guo et al., 21 Jan 2024):

Software Engineering, where agents assume product manager, programmer, and tester roles along a production pipeline.
Robotics and Embodied AI, coordinating multiple robots over simulated or physical platforms.
Economics and Social Simulation, modeling crowd dynamics, market interactions, or world-simulation games (e.g., Werewolf, Welfare Diplomacy).
Scientific and Engineering Workflows, including digital twin parametrization (Xia et al., 28 May 2024) and chemical engineering process design (Rupprecht et al., 11 Aug 2025), where MAS coordinate simulation, planning, and real-time control.
Evaluation and Gaming, via benchmarking platforms such as WiS (“Who is Spy?”) (Hu et al., 4 Dec 2024), which test agent reasoning, deception, and adversarial capabilities, generating role-specific win rates, vote accuracies, and foul rates for model comparison.

Datasets/benchmarks span code (HumanEval, MBPP), reasoning (MMLU, GSM8K), robotics (RoCoBench, HM3D), economic/recommender environments (MovieLens-1M), and custom game/simulation tasks, enabling comprehensive measurement of both individual and emergent MAS competencies.

7. Open Challenges and Future Directions

Open challenges in LLM-MAS research include:

Scalability: Efficient orchestration and communication protocols are needed for large-scale agent systems due to quadratic scaling in communication.
Multi-modality: Extending text-centric LLM-MAS to robustly incorporate multimodal data (vision, audio, signals) and engineer inherently multimodal foundation models tailored for specific domains (Rupprecht et al., 11 Aug 2025).
Inter-Agent Trust and Security: Designing robust, semantic-aware trust management and monitoring systems to verify and prevent adversarial information propagation remains unsolved (He et al., 2 Jun 2025).
Collective Intelligence and Emergence: While current architectures mainly optimize agents individually, methodologies to explicitly design for, harness, and evaluate emergent collective intelligence are underdeveloped (Guo et al., 21 Jan 2024).
Transparent/Interpretable Agent Design: Given safety and regulatory requirements, architectures and workflows need to be interpretable and auditable, especially for high-stakes applications in engineering, finance, and healthcare (Rupprecht et al., 11 Aug 2025).
Evaluation Benchmarking: There is a pressing need for standardized, system-level benchmarks that measure not only agent accuracy but also emergent properties, robustness, and interactive capabilities (Yan et al., 20 Feb 2025).
Environmental and Resource Impact: As large-scale systems become pervasive in industrial settings, quantifying and minimizing their computational and environmental footprints is increasingly important, as discussed in chemical engineering contexts (Rupprecht et al., 11 Aug 2025).

Research trajectories point toward the development of increasingly autonomous, robust, and collaborative LLM-MAS, with frameworks that leverage dynamic, heterogeneous, and multi-modal agent teams; holistic security models; continual adaptation; and domain-specialized foundation models—all underpinned by open-source platforms and comprehensive, multi-granular benchmarking.