Mixture of Agents (MoA) Architecture

Updated 25 August 2025

MoA is a multi-agent computational architecture that integrates specialized models using dynamic routing and layered aggregation.
It employs mechanisms like top-k selection, token-level switching, and expert role assignment to enhance efficiency and output diversity.
Empirical evaluations show MoA delivers improved performance and reduced computation costs across applications such as NLP, computer vision, and healthcare.

A Mixture of Agents (MoA) is a collaborative, multi-component computational architecture in which multiple specialized models or agents—typically LLMs, neural network modules, or expert systems—work together to solve a given task through sequential, parallel, or dynamically routed interaction. MoA systems are distinguished from traditional single-model or ensemble methods by the explicit design of agent interaction, dynamic selection or routing, hierarchical aggregation, and the use of iterative refinement or layered processing. MoA has been deployed across language modeling, computer vision, code optimization, healthcare, remote sensing, and beyond, with several architectures delivering state-of-the-art results compared to strong single-model or basic ensemble baselines.

1. Core MoA Architectures and Routing Strategies

MoA instantiates a broad family of system designs unified by the principle that multiple computational agents (models or modules), each with a distinct or overlapping sub-expertise, interact to produce the final output.

Layered and Feedforward MoA: Many MoA architectures follow a layered structure. Each layer comprises multiple agents that receive the original input and, crucially, the outputs of all agents from the previous layer as auxiliary context. The process typically iterates as follows:

$y_i = \bigoplus_{j=1}^{n} [A_{i,j}(x_i)] + x_1, \quad x_{i+1} = y_i,$

where $\bigoplus$ indicates aggregation and synthesis. This design enables iterative answer refinement, as shown in (Wang et al., 7 Jun 2024, Ashiga et al., 5 Aug 2025).

Mixture-of-Experts Attention (MoA): In the context of Transformer networks, MoA can be implemented at the attention layer. A pool of $N$ attention experts, each with its own query ( $W^\theta_i$ ) and output ( $W^o_i$ ) projections but shared key ( $W^k$ ) and value ( $W^v$ ), is combined with a top- $k$ gating network that, for each token, selects and weights $k$ experts:

$p_{i,t} = \text{Softmax}_i(q_t W_g), \quad G(q_t) = \text{TopK}(\{p_{1,t},...,p_{N,t}\}, k), \quad w_{i,t} = p_{i,t} / \text{Detach}\big(\sum_{j\in G(q_t)} p_{j,t}\big)$

Final output is $y_t = \sum_{i\in G(q_t)} w_{i,t} E_i(q_t, K, V)$ (Zhang et al., 2022).

Agent-Based Aggregation: In applications such as code optimization or clinical summarization, multiple specialized agents process inputs in parallel. Their outputs are merged using an aggregator agent or dedicated synthesis LLM, which integrates the best elements from each suggestion (Ashiga et al., 5 Aug 2025, Jang et al., 4 Apr 2025).
Token-level Switching: Advanced MoA variants dynamically select, at each decoding step, the "winning" agent's next-token prediction using a utility function:

$y_t = \underset{z}{\arg\max} \max_{j} J^{\pi_j}_{target}(s_t, z), \quad J^{\pi_j}_{target}(s_t, z) = Q^{\pi_j}_{target}(s_t, z) - \alpha \cdot \text{KL}(\pi_j(\cdot|s_t) \| \pi_{ref}(\cdot|s_t)),$

as in (Chakraborty et al., 27 Mar 2025).

2. Efficiency, Diversity, and Specialization Mechanisms

MoA architectures often implement mechanisms to balance computational efficiency and diversity.

Sparse Routing and Top-K Selection: To avoid activating all experts/agents—improving efficiency—MoA employs sparse top- $k$ routing (per token, per segment, or per agent), leading to computational cost:

$C_{\mathrm{MoA}} = k T^2 d_h + 2(k+1) T d_h d_m$

(for attention-based MoA, $k$ is the number of selected experts, $T$ sequence length, $d_h$ head dimension, $d_m$ hidden size) (Zhang et al., 2022).

Diversity Maximization: Diversity among agent outputs is critical for effective aggregation. RMoA (Xie et al., 30 May 2025) implements embedding-based greedy selection, maximizing cosine distance between chosen outputs so that successive layers process maximally informative candidates.
Expert Role Assignment: SMoA (Li et al., 5 Nov 2024) strengthens output diversity and robustness by giving each agent a distinct "role description," prompting divergent perspectives and preventing agent homogenization.
Residual Information Propagation: RMoA introduces residual extraction and aggregation components. Residual difference vectors between layer outputs are explicitly extracted and propagated to mitigate information loss during iterative multi-agent processing.

3. Evaluation Metrics and Empirical Benchmarking

MoA's efficacy has been established on a broad range of tasks, often via complex benchmarks:

Machine Translation (Zhang et al., 2022):
- MoA base: BLEU 28.4 (WMT14 En-De), surpassing Transformer base (27.3).
- MoA big: Lower MACs cost for equivalent or superior BLEU compared to larger Transformer models.
Instruction Following and Chatbot Evaluation (Wang et al., 7 Jun 2024, Li et al., 2 Feb 2025, Wolf et al., 7 Mar 2025):
- MoA (open-source only): 65.1% LC win rate (AlpacaEval 2.0), vs. GPT-4 Omni 57.5%.
- Self-MoA outperforms Mixed-MoA: +6.6% on AlpacaEval 2.0 by using outputs from a single high-performing model.
Software Development (Sharma, 26 Jul 2024):
- Patched MoA improved gpt-4o-mini Arena-Hard-Auto score by 15.52%, outperforming gpt-4-turbo with a 1/50th cost.
Clinical Prediction and Healthcare QA (Gao et al., 7 Aug 2025, Jang et al., 4 Apr 2025):
- MoA/Ensembles of LLMs outperform prior single-agent models (macro-F1: 0.85; micro-F1 > 0.90 for trauma stratification).
Regulated Code Optimization (Ashiga et al., 5 Aug 2025):
- MoA: 14.3–22.2% cost savings, 28.6–32.2% faster times than an industrial GA-based baseline with open-source models.

4. Applications Across Domains

MoA methodologies are adaptable across diverse application settings:

NLP and LLM Evaluation: Iterative agent aggregation and synthesis for chat, instruction following, and multi-turn dialogue.
Software Engineering: Automated code refactoring, optimization, vulnerability detection (often in regulated industries) by synthesizing the strengths of domain- or capability-specialized LLM agents (Ashiga et al., 5 Aug 2025, Yarra, 25 Apr 2025).
Healthcare: Integration of multimodal EHR data via specialist and aggregator agents, with superior performance in clinical prediction and summarization tasks (Gao et al., 7 Aug 2025, Jang et al., 4 Apr 2025).
Remote Sensing and PEFT: MoA combined with DFT-based decomposition in Earth-Adapter addresses domain adaptation and artifact separation in satellite imagery (Hu et al., 8 Apr 2025).
Mathematical and Commonsense Reasoning: MoA and its alternatives (e.g., MoO (Chen et al., 26 Feb 2025)) have been evaluated for math and commonsense benchmarks, revealing trade-offs between aggregation quality and diversity.

5. Robustness, Safety, and Limitations

Critical MoA vulnerabilities include susceptibility to adversarial and deceptive agents. As demonstrated in (Wolf et al., 7 Mar 2025), introducing a single carefully-instructed deceptive LLM to a 3-layer MoA (6 agents total) reduced AlpacaEval 2.0 LC Win Rate from 49.2% to 37.9%. Defense mechanisms inspired by historical voting protocols (e.g., Dropout Vote, Dropout Cluster) can recapture much of the lost performance by majority or cluster-based filtering.

There are additional practical and theoretical limitations:

Increased latency and resource demand in multi-layer and distributed MoA systems, motivating research in adaptive control and parallelization (Chen et al., 4 Sep 2024, Mitra et al., 30 Dec 2024).
The trade-off between output diversity and single-model quality: Self-MoA shows that aggregating outputs from a single strong model can exceed mixing multiple (potentially weaker) models (Li et al., 2 Feb 2025).
Bottlenecks in edge or distributed deployments due to queuing stability constraints, with the stability criterion $((k+1)M+1)\lambda < 1/\alpha$ connecting layers, proposer count, and inference time (Mitra et al., 30 Dec 2024).

6. Extensions, Variants, and Future Directions

Major MoA variants include:

RMoA (Residual Mixture-of-Agents): Residual learning prevents information loss and enables diversity maximization with dynamic adaptive termination (Xie et al., 30 May 2025).
SMoA (Sparse Mixture-of-Agents): Incorporates response selection (Judge) and early stopping (Moderator) for token and computation efficiency (Li et al., 5 Nov 2024).
MoA: Heterogeneous Mixture of Adapters: Introduces heterogeneous, cooperative adapters for parameter-efficient LLM tuning, surpassing homogeneous MoE-LoRA methods in accuracy and efficiency (Cao et al., 6 Jun 2025).
Self-MoA and Sequential Self-MoA: Focus on maximizing ensemble quality using a single high-performing agent, challenging the utility of diversity from different LLMs except in domain-heterogeneous tasks (Li et al., 2 Feb 2025).
MoMA (Mixture-of-Multimodal-Agents): Extends the MoA principle to specialist agents for multimodal EHR integration and clinical decision support, demonstrating strong performance gains with "translation" agents for non-text modalities (Gao et al., 7 Aug 2025).

Anticipated research trends include dynamic agent selection, advanced routing and gating mechanisms, more scalable and modular collaboration schemas, and context-aware agent specialization, as well as efforts to further reduce latency and cost in distributed or edge deployments. The robustness of MoA systems to adversarial attacks and deceptive agent influence remains a fundamental challenge requiring continued algorithmic and evaluation advances.

7. Theoretical and Mathematical Foundations

MoA formalism often incorporates explicit mathematical models to describe agent interaction, computational complexity, and the balance between routing, diversity, and specialization. Key relationships and equations include:

Feature	Representative Formula	Application
MoA Layer Output	$y_i = \bigoplus_j [A_{i,j}(x_i)] + x_1$	Multi-layer MoA
Top-k Routing (Attention)	$G(q_t) = \text{TopK}(\{p_{1,t},...,p_{N,t}\}, k)$	Attention layer MoA
Weighted Expert Integration	$w_{i,t} = p_{i,t} / \text{Detach}\sum_{j\in G} p_{j,t}$	Attention, Adapters
Token-level Utility	$y_t = \underset{z}{\arg\max} \max_{j} J^{\pi_j}_{target}(s_t, z)$	Token-level switching
Queuing Stability	$((k+1)M+1)\lambda < 1/\alpha$	Edge MoA (Mitra et al., 30 Dec 2024)

These structures enable systematic evaluation and optimization across the diversity of MoA implementations.

In summary, Mixture of Agents architectures offer a general, theoretically founded framework for combining the strengths of specialized computational agents to achieve improved accuracy, robustness, interpretability, and scalability across a range of machine learning and artificial intelligence domains. Ongoing research continues to refine MoA methods to address efficiency limitations, establish optimal team composition, and anticipate adversarial scenarios, setting a foundation for future distributed, collaborative, and modular AI systems.