X-MAS Paradigm: Heterogeneous AI & Robustness

Updated 27 November 2025

X-MAS is defined as an extensible heterogeneous multi-agent system that assigns function-specific optimal LLMs via empirical benchmarking.
It employs a moving-averaged sample method to mitigate adversarial perturbations by exploiting local smoothness in data.
Empirical evaluations demonstrate significant accuracy gains over homogeneous systems across diverse benchmark domains.

The X-MAS paradigm encompasses two distinct lines of research converging on principled approaches for enhancing collective intelligence and robustness in artificial systems. In the context of multi-agent systems, X-MAS refers to extensible heterogeneous architectures wherein each agent is powered by a distinct LLM selected for maximal specialization by function and domain (Ye et al., 22 May 2025). Separately in adversarial robustness, X-MAS designates "X minus Moving Averaged Samples," a methodology for mitigating adversarial perturbations in input data via localized smoothing and perturbation estimation (Chun et al., 2019). Both frameworks formalize ensemble-like strategies that harness diversity—of model architecture in the case of LLMs, and of data frequency characteristics in adversarial defense—to achieve superior system performance.

1. Formal Definitions and Conceptual Overview

The multi-agent X-MAS paradigm is defined as an extensible heterogeneous MAS in which agents assigned to canonical functions—question answering (QA), iterative refinement (Revise), aggregation, planning, and evaluation—are driven by the empirically best-performing LLM for their role and domain. Formally, for each agent role $R$ and domain $D$ , the driver LLM is $M^{\dagger}(R, D)$ , chosen by benchmarking accuracy across a pool of candidate models.

In adversarial robustness, X-MAS is formalized as follows: for a clean sample $X \in \mathbb{R}^{H \times W \times C}$ and an $N \times N$ moving average kernel $W_{avg}$ , define the moving-averaged sample $W_{avg}*X$ and the residual $X_{MAS}(X) = X - (W_{avg} * X)$ . This empirical relation exploits local smoothness in data, with $X \approx W_{avg}*X$ in most natural images, enabling effective estimation of adversarial perturbations.

2. Multi-Agent X-MAS: Principles and Architecture

A traditional LLM-based MAS constrains all agent flows—planning, answering, revising, evaluating, aggregating—to a single underlying LLM. In contrast, X-MAS prescribes:

Partitioning agent roles (e.g., QA, planning, aggregation) and domains (math, coding, science, medicine, finance).
Assigning to each $(R, D)$ pair an optimal LLM $M^{\dagger}(R, D)$ identified through empirical benchmarking.
Retaining all prompts, message formats, and interaction graphs; only the agent driver changes.
Enabling modular “plug-and-play” substitution: existing MAS can be retrofitted in sub-minute timescales by swapping LLMs per agent.

The architecture is exemplified by "X-MAS-Proto": planning agent generates $N$ plans, QA agents respond per plan, evaluator reviews, aggregator synthesizes final answers. No structural redesign is required to exploit the collective intelligence of heterogeneous LLMs (Ye et al., 22 May 2025).

3. Benchmarking and Empirical Model Selection

Rigorous evaluation is conducted via X-MAS-Bench, which spans five target domains—mathematics, coding, science, medicine, finance—and five canonical MAS-related functions. Each function is benchmarked using standardized prompt templates, zero-stochasticity for planning (temperature $= 0$ ), and temperature $= 0.5$ elsewhere. Up to 500 random samples per dataset (some with 800) underpin accuracy-based model selection, where, for each $(f, d, m)$ :

$acc(f, d, m) = \frac{\#(\text{correct outputs})}{N}$

Top-performing LLMs are recorded, and the single best per function-domain pair is assigned as $M^{\dagger}(f, d)$ . Over 1.7 million LLM evaluations support this empirical process, without cross-validated optimization (Ye et al., 22 May 2025).

Benchmark Domains and Functions (X-MAS-Bench)

Domain	Example Datasets	Functions
Mathematics	MATH, AQUA-RAT, GSM-Hard, AIME-2024	QA, Revise, Aggregate, Plan, Evaluate
Coding	HumanEval, MBPP, EvalPlus	QA, Revise, Aggregate, Plan, Evaluate
Science	GPQA, SciBench, SciEval	QA, Revise, Aggregate, Plan, Evaluate
Medicine	MedMCQA, MedQA, PubMedQA	QA, Revise, Aggregate, Plan, Evaluate
Finance	FinanceBench, FinQA, FPB	QA, Revise, Aggregate, Plan, Evaluate

4. Performance Metrics and Experimental Results

Performance is measured using two key metrics:

$acc_{hom}$ : accuracy of the best homogeneous MAS (all agents share $m^*$ )
$acc_{het}$ : accuracy of the optimal heterogeneous MAS (each agent uses $M^{\dagger}(R, D)$ )

Relative improvement is calculated as

$\Delta\% = 100 \times \frac{acc_{het} - acc_{hom}}{acc_{hom}}$

Key findings include:

Chatbot-only MAS (e.g., Qwen2.5-32B): MATH-500 accuracy improves from $84.0\%$ (homogeneous) to $91.0\%$ (heterogeneous, $\Delta \sim +8.4\%$ ) (Ye et al., 22 May 2025).
Mixed chatbot–reasoner scenario on AIME-2024: $acc_{hom}$ (chatbot) $\sim20\%$ , $acc_{hom}$ (reasoner) $\sim0\%$ , $acc_{het}\sim50\%$ ( $\Delta \sim +150\%$ relative, $+30$ points).
New benchmarks (AIME-2025, MATH-MAS): homogeneous $\sim10–14\%$ , X-MAS $\sim46–48\%$ ( $+33–34$ points).
Even for moderate-sized or specialized LLMs, domain/function-specific selection often outperforms uniform deployment of large models.

5. X-MAS in Adversarial Robustness: Formalism and Mitigation Schemes

In adversarial defense, X-MAS leverages the local smoothness prior to estimate and mitigate perturbations:

For adversarial example $X_{adv} = X \pm \epsilon$ , estimate perturbation as

$\hat{\epsilon} = X_{adv} - (W_{avg} * X_{adv}) \approx \epsilon - W_{avg}*\epsilon$

For typical high-frequency $\epsilon$ , $W_{avg}*\epsilon$ is small, yielding $\hat{\epsilon} \approx \epsilon$ (Chun et al., 2019).

Mitigation:
- "Plus" mode ( $X_{adv} = X+\epsilon$ ): $X_{mitig} = X_{adv} - \hat{\epsilon}$
- "Minus" mode ( $X_{adv} = X-\epsilon$ ): $X_{mitig} = X_{adv} + \hat{\epsilon}$

Multi-level mitigation iteratively re-applies the scheme, enforcing update boundaries to prevent over-mitigation and guarantee $X_{adv}^{(k)} \rightarrow X$ as $k \rightarrow \infty$ . Empirically, 100 iterations with a $7\times7$ kernel yield top-tier robustness—classification accuracy rises from $<1\%$ (unmitigated, $\epsilon=32,64)$ to $54\%-61\%$ post-mitigation. Post-JPEG smoothing optimizes further in some configurations.

6. Deployment Guidelines, Limitations, and Future Directions

MAS with Heterogeneous LLMs

Practitioners are advised to:

Define precise domain-function splits.
Use X-MAS-Bench results to identify $M^{\dagger}(f, d)$ per role.
Substitute agent drivers with corresponding optimal LLMs.
Monitor accuracy; iterate upon introduction of new models.

The plug-and-play nature enables rapid adaptation, cost-performance tradeoff via mixing sizes/types, and future extension to automated model routing or dynamic agent team formation (Ye et al., 22 May 2025).

Adversarial X-MAS

Key limitations:

Robustness depends on local smoothness assumption.
Kernel size $|W_{avg}|$ and coefficients must be tuned to the data distribution.
Pixel constraints may require explicit post-processing.

Potential extensions include learning adaptive kernels per sample, integration with denoising priors, application to other modalities/architectures (e.g., ViTs, audio), and combination with adversarial training or randomized smoothing (Chun et al., 2019).

7. Implications and Significance

The X-MAS paradigm, across both multi-agent LLM systems and adversarial robustness, exemplifies the utility of heterogeneous, ensemble-driven strategies. In multi-agent architectures, empirically optimal functional specialization via diverse LLMs yields marked improvements in end-to-end system accuracy with minimal engineering overhead. In adversarial defense, moving-averaged subtraction offers a lightweight, attack-agnostic approach for mitigating even large, high-frequency perturbations by exploiting statistical priors of natural data. This suggests a broader principle: judicious exploitation of heterogeneity—whether in intelligence sources or statistical structure—can deliver substantial amplifications in performance and robustness.

Markdown Report Issue Upgrade to Chat

References (2)

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs (2025)

Mitigating large adversarial perturbations on X-MAS (X minus Moving Averaged Samples) (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to X-MAS Paradigm.