Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies (2502.02533v1)
Abstract: LLMs, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
Summary
- The paper introduces the Multi-Agent System Search (MASS) framework, a novel three-stage method that optimizes LLM-based multi-agent systems by sequentially refining local agent prompts, system topology based on component influence, and global workflow prompts.
- Experimental results show MASS achieves significant performance gains (10-13% average accuracy increase) across diverse tasks and LLM backbones compared to single-agent methods and prior automated multi-agent design frameworks.
- The study provides key design insights, demonstrating the critical impact of prompt optimization over merely scaling agents and validating the efficiency of influence-guided topology search, leading to practical principles for MAS development.
Designing effective multi-agent systems (MAS) using LLMs involves navigating a complex design space encompassing both agent prompts and the interaction topologies orchestrating their collaboration. The inherent sensitivity of LLMs to prompts and the combinatorial explosion of potential workflow structures make manual design or naive automated search computationally demanding and often suboptimal. The Multi-Agent System Search (MASS) framework (2502.02533) addresses this challenge by proposing a structured, multi-stage optimization process that interleaves prompt and topology optimization, leveraging empirical insights about the relative importance of these design dimensions. MASS aims to automate the discovery of high-performing MAS configurations by efficiently exploring the design space, moving from local component refinement to global system tuning.
MASS Methodology: A Staged Optimization Approach
The MASS framework decomposes the joint optimization problem into three sequential stages, progressively refining the MAS design. This structured approach is motivated by the empirical finding that high-quality prompts for individual agent roles are crucial and that topology optimization benefits significantly from starting with well-performing components.
Stage 1: Block-level (Local) Prompt Optimization
This initial stage focuses on optimizing the prompts for individual "building blocks" or agent types before composing them into complex workflows. The rationale is to establish a baseline of competent individual agents, mitigating the risk of propagating errors or inefficiencies through the system and simplifying the subsequent topology search.
- Initial Predictor Optimization: An initial "predictor" agent (a0), representing the basic task-solving unit, is optimized using an Automatic Prompt Optimization (APO) technique. The paper utilizes MIPRO, which optimizes both the instructional part of the prompt and the few-shot demonstrations, denoted as O. This results in an optimized initial predictor a0∗=O(a0).
- Minimal Block Configuration Optimization: For each distinct agent type or "block" defined in the search space (e.g., Aggregate, Reflect, Debate, custom agents), a minimal configuration involving that block type is constructed. For instance, a minimal Debate block might consist of two predictor agents and one debater agent. The prompts for the agents within this minimal configuration (ai) are then optimized using the APO technique O, conditioned on the previously optimized initial predictor a0∗. This yields an optimized block ai∗=O(ai∣a0∗). Conditioning the optimization helps manage complexity by leveraging the already optimized baseline predictor.
- Incremental Influence Calculation: The performance improvement, or incremental influence (Iai), gained by using the optimized block ai∗ compared to the baseline optimized predictor a0∗ is computed on a validation dataset. This metric quantifies the value added by each specific block type after local optimization. Iai=Performance(ai∗)−Performance(a0∗).
This stage ensures that the building blocks entering the topology search are individually effective, significantly reducing the search burden in the next stage.
Stage 2: Workflow Topology Optimization
Leveraging the optimized blocks and their calculated influence scores from Stage 1, this stage searches for the most effective workflow topology. The key idea is to prune the vast space of possible agent arrangements by prioritizing configurations involving block types demonstrated to be influential.
- Influence-based Pruning: The incremental influence scores (Iai) are used to guide the topology search. A selection probability (pa) for including each topology (block type) ai is determined, often using a Softmax function over the influence scores with a temperature parameter t to control the sharpness of the selection: pai=∑jexp(Iaj/t)exp(Iai/t). This probabilistic pruning focuses the search on topologies with higher demonstrated utility.
- Candidate Workflow Sampling: Candidate workflow configurations (Wc) are sampled based on the selection probabilities pa. During the composition of a candidate workflow, the optimized prompts associated with each selected block type ai∗ (from Stage 1) are utilized.
- Structured Composition and Validation: To manage combinatorial complexity, the paper employs a rule-based ordering for composing blocks within a workflow (e.g.,
[summarize, reflect, debate, aggregate]
). Rejection sampling is applied to discard invalid workflows or those exceeding predefined constraints (e.g., maximum number of agents, computational budget). Each valid sampled workflow Wc is then evaluated on a validation dataset. - Best Topology Selection: This sampling and evaluation process iterates for a predefined number of candidates (N) or until a computational budget is exhausted. The workflow topology (Wc∗) yielding the highest performance on the validation set is selected.
This stage efficiently navigates the topology search space by using the insights from Stage 1 to intelligently bias the search towards promising architectures composed of high-quality, pre-optimized components.
Stage 3: Workflow-level (Global) Prompt Optimization
The final stage involves fine-tuning the prompts of the entire selected workflow (Wc∗) as a single, integrated system. This step aims to capture and optimize the interdependencies and collaborative nuances between agents within the specific context of the chosen topology, which might not be fully addressed during local block optimization.
- Holistic Optimization: The complete workflow Wc∗, including all its constituent agents and their connections as determined in Stage 2, is treated as the optimization target.
- Global APO Application: The automatic prompt optimization technique (O) is applied again, this time optimizing the prompts (instructions and/or demonstrations) across the entire workflow simultaneously. This results in the final, globally optimized MAS, WMASS=O(Wc∗).
This global refinement step allows the prompts to adapt specifically to the interactions mandated by the chosen topology, potentially unlocking further performance gains by enhancing agent collaboration within the final system structure.
Experimental Validation and Key Findings
The MASS framework was evaluated across diverse tasks, including reasoning (GSM8K, MATH), multi-hop long-context understanding (HotpotQA, MuSiQue), and coding (HumanEval, MBPP), using various LLM backbones like Gemini 1.5 Pro/Flash and Claude 3.5 Sonnet.
- Performance Superiority: MASS consistently demonstrated substantial performance improvements over multiple baselines. Compared to single-agent Chain-of-Thought (CoT), MASS achieved average accuracy gains of approximately 10-13% across tasks and models (Table 1). It also significantly outperformed established manually designed MAS like Self-Consistency, Self-Refine, and Multi-Agent Debate (using default prompts), as well as prior automated MAS design frameworks like ADAS and AFlow. For example, on GSM8K with Gemini 1.5 Pro, MASS achieved 94.8% accuracy compared to 84.2% for CoT and 92.1% for the best baseline (ADAS).
- Ablation Study Insights: Ablation studies confirmed the contribution of each stage. Stage 1 (local prompt optimization) provided a significant performance uplift compared to directly optimizing a single agent or using default prompts. Stage 2 (topology optimization) built upon this, finding structures that further enhanced performance. Stage 3 (global prompt optimization) provided additional, consistent improvements, validating the benefit of fine-tuning prompts within the final workflow context (Fig. 5, Table 3). Critically, comparing MASS's topology search (Stage 2) with variants that skipped Stage 1 or the influence-based pruning revealed significantly worse performance, highlighting the efficacy of MASS's structured approach and search space management (Fig. 5, right).
- Cost-Effectiveness & Efficiency: The optimization trajectory of MASS was observed to be more stable and efficient compared to ADAS and AFlow, suggesting its staged approach avoids potentially unstable or inefficient exploration phases (Fig. 6). The framework's emphasis on prompt optimization, particularly in Stage 1, aligns with findings that prompt improvements can yield substantial gains with relatively lower token costs compared to simply scaling the number of agents (Fig. 2, Fig. 9).
- Generalizability: The performance benefits of MASS were consistent across the different task domains and LLM backbones tested, indicating the framework's robustness and general applicability (Table 1, Table 2).
Significance and Contributions
The research presents several notable contributions to the design and optimization of LLM-based multi-agent systems.
- Empirical Design Insights: The initial analysis provides empirical backing for crucial design considerations. It quantifies the significant impact of prompt optimization, often outweighing the benefits of merely increasing agent numbers initially. It also reveals that only a subset of potential topologies typically yields substantial performance gains, justifying the use of targeted search and pruning strategies like those employed in Stage 2.
- Novel Multi-Stage Optimization Framework: MASS introduces a systematic, staged methodology (local prompt opt -> topology opt -> global prompt opt) that effectively manages the complexity of jointly optimizing prompts and topologies. This structured, iterative refinement approach represents a distinct strategy compared to monolithic search or purely topology-focused methods.
- State-of-the-Art Automated MAS Design: By integrating rigorous prompt optimization at multiple levels with an efficient, influence-guided topology search, MASS demonstrably achieves superior performance compared to existing manual and automated approaches, advancing the capabilities of automatically generated MAS.
- Distilled Design Principles: Based on the framework's success and analysis, the work proposes practical design principles for MAS development: prioritize optimizing individual agent prompts before composition, focus topology search on influential structures, and perform a final global prompt tuning pass to optimize inter-agent dynamics within the chosen workflow. These principles offer actionable guidance for practitioners.
In conclusion, the MASS framework provides a structured and effective approach for automating the design of high-performing LLM-based multi-agent systems. By strategically decomposing the optimization problem and leveraging empirical insights about the design space, it efficiently navigates the complexities of prompt and topology selection, yielding MAS configurations that significantly outperform prior methods across various tasks and models.