X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs (2505.16997v1)

Published 22 May 2025 in cs.AI, cs.CL, and cs.MA

Abstract: LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.

Summary

The paper introduces a comprehensive benchmark (X‑MAS‑Bench) and design methodology (X‑MAS‑Design) to evaluate and build heterogeneous multi-agent systems using diverse LLMs.
X‑MAS‑Bench assesses 27 LLMs over five functions and domains, revealing that no single model excels universally and niche models often perform better in specific scenarios.
Experiments demonstrate that replacing homogeneous agents with specialized LLMs can boost performance by up to 8.4% in targeted tasks without altering the overall system architecture.

The paper “X‑MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs” introduces a novel paradigm for multi-agent systems (MAS) that leverages the collective intelligence of diverse LLMs rather than relying on a single, monolithic model. The authors propose two main contributions: a comprehensive benchmark (X‑MAS‑Bench) for evaluating LLMs in MAS contexts and a design methodology (X‑MAS‑Design) that demonstrates how existing MAS frameworks can be adapted into heterogeneous systems with minimal structural changes.

Key Contributions and Findings

X‑MAS‑Bench:
- The authors develop a testbed that assesses 27 different LLMs—spanning both general-purpose chatbots and specialized reasoners—across five critical MAS functions: question‑answering, revise, aggregation, planning, and evaluation.
- These functions are evaluated over five domains (mathematics, coding, science, medicine, and finance) with more than 1.7 million evaluations collected.
- The comprehensive evaluations reveal that no single LLM consistently performs best across all domains and tasks. Some models excel in certain functions or domains while underperforming in others, and even smaller or less resource‑intensive models can outperform larger ones in specific niche scenarios.
X‑MAS‑Design:
- Based on insights from the benchmark, the paper proposes a method to transition conventional homogeneous MAS (where all agents are driven by the same LLM) to heterogeneous systems by assigning specific agents the LLMs best suited for their designated functions.
- The transformation is achieved by substituting the single homogeneous model for each agent with a top‑performing model identified from X‑MAS‑Bench. This simple re‑configuration does not require any changes to the system’s underlying interaction logic or prompt templates.
- Empirical experiments demonstrate that heterogeneous MAS configurations yield significant improvements. For example, in a chatbot‑only scenario, the heterogeneous design achieves up to an 8.4% performance gain on a mathematics benchmark, while mixed configurations (combining chatbots and reasoners) show even larger gains on complex reasoning tasks such as those in competitive math benchmarks.

Experimental Setup and Methodology

The paper considers both “chatbots” (instruction‑tuned LLMs for conversational tasks) and “reasoners” (models optimized for structured, chain‑of‑thought reasoning).
Each MAS function is standardly defined with fixed prompts and controlled experimental conditions so that performance variations stem solely from the LLM differences. For example:
- In the aggregation function, the system feeds a query and several candidate answers (from a fixed set of LLMs) to the model under evaluation, and the aggregated answer is compared with ground‑truth.
The framework is applied to several existing MAS methods such as AgentVerse, LLM‑Debate, and DyLAN, as well as a prototype system called X‑MAS‑Proto that integrates all five functions into one pipeline.
Ablation studies reveal that increasing the number of candidate LLM models in the heterogeneous configuration generally leads to improved performance, and using the X‑MAS‑Bench guidance for model selection is critical to achieve the best results.

Practical and Implementation Implications

The proposed approach shows that by carefully selecting and assigning different LLMs based on their domain‑ and function‑specific strengths, one can build more robust and scalable MAS without retraining or redesigning the entire system architecture.
For practitioners, the work provides a roadmap for incremental upgrades to existing MAS frameworks: the transformation can be accomplished by simply replacing the homogeneous agent with a set of specialized models—a process that can be executed in minutes or even automated in future systems.
The extensive evaluations and open‑source release of code and data enable developers and researchers to experiment with heterogeneous MAS configurations and further explore dynamic or automated agent selection strategies.
Overall, this research points toward a promising direction in collaborative AI, where diverse models complement one another’s strengths and mitigate individual limitations, thus enabling more effective solutions on complex real‑world tasks.

Limitations and Future Directions

Despite the extensive evaluation, some LLMs remain untested, and the current process of agent selection is manual; future work could focus on automated or dynamic model selection based on task requirements.
As with any work involving LLMs, there are underlying risks such as issues of hallucination and misuse that persist across applications.
The results also invite further exploration into how the synergy between diverse models can be optimized, potentially through fine‑tuning methods geared toward multi‑agent collaboration.

In summary, “X‑MAS” provides strong empirical evidence and practical guidance for building MAS with heterogeneous LLMs. By leveraging the diverse capabilities of different models, the proposed approach significantly enhances the collective problem‑solving capacity of multi-agent systems.

PDF Markdown

YouTube

Show All Videos

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs (2505.16997v1)

Summary

Related Papers

YouTube