MoMA: Mixture-of-Multimodal Agents
- MoMA is a modular computational architecture integrating LLMs, domain-specific tools, and expert agents to address multi-domain, multimodal inference challenges.
- It employs a dual-stage routing system using a context-aware finite state machine and Pareto-frontier optimization to balance cost and performance.
- MoMA’s plug-and-play design supports specialized aggregation for clinical predictions and task-level gating in open-world applications, achieving significant cost reductions.
A Mixture-of-Multimodal-Agents (MoMA) is a computational architecture that orchestrates multiple heterogeneous specialized agents—including but not limited to LLMs, domain-specific tools, and expert models—across diverse input modalities and tasks. Originating as a direct solution to both routing and interference challenges in multi-domain inference and multimodal reasoning, MoMA systems unify agent-level selection, model-level expert selection, and multi-agent coordination within a general, modular inference paradigm. MoMA abstracts across several instantiations: routing and orchestration of LLMs/tools in multi-domain QA and task completion (Guo et al., 9 Sep 2025), specialist agent cascades for clinical prediction from EHR data (Gao et al., 7 Aug 2025), and task-isolated Mixture-of-Experts Transformers in open-world embodied agents (Li et al., 12 Jun 2025).
1. MoMA System Architectures
MoMA systems are typified by an explicit resource pool comprising both LLMs and specialized agents, and a routing+aggregation pipeline whose architecture varies by domain:
- LLM-Agent Routing (Generalized Inference): The MoMA router (Guo et al., 9 Sep 2025) consists of two stages. The first, "agent routing," uses embedded query-category retrieval and a context-aware finite state machine (CA-FSM) to identify and dynamically mask available domain agents (e.g., code generator, travel assistant). The second, "LLM routing," invokes a Mixture-of-Experts head that predicts model-specific utility vectors and applies Pareto-frontier+TOPSIS selection to balance accuracy/cost among LLMs of varying scale.
- Specialist Aggregation (Clinical Prediction): Each non-text modality is projected via a specialist LLM agent to produce structured summaries (e.g., CXR-LLAVA-v2 for chest X-rays, Llama-3 for lab tables) (Gao et al., 7 Aug 2025). These are concatenated with textual clinical notes and compressed with a frozen aggregator LLM; a predictor agent produces final outputs from the aggregator's latent representation.
- Task-Level Gated MoE Transformers (Open-World Agents): In Optimus-3, each MoE layer incorporates K task-specific experts and one shared knowledge expert (Li et al., 12 Jun 2025). A lightweight router classifies the instruction and activates only the corresponding expert plus the shared expert, entirely avoiding cross-task gradient interference.
This modular design generalizes across domains, supporting pluggable agents and scalable resource expansion.
2. Mathematical Formulation and Routing Algorithms
MoMA routing and aggregation are formalized as multi-stage, multi-agent decision processes:
- LLM Routing Optimization: For a query , with LLM set , and costs for each model , the router predicts for each model and applies score-cost normalization:
The ideal (0,1) and anti-ideal (1,0) points define closeness via the TOPSIS metric. The optimal model is selected as (Guo et al., 9 Sep 2025).
- Task-Level Expert Gating: For instruction-identified task , gating vector is binary: 0, 1, 2 elsewhere; the layer output is
3
- Modal Aggregation in Clinical MoMA: The aggregator receives input
4
and outputs 5; a predictor applies 6, with softmax/sigmoid for classification (Gao et al., 7 Aug 2025).
Training is task-conditional and may use categorical cross-entropy, RL surrogates, or fine-tuning of only final predictor modules.
3. Training Data Construction and Optimization
- Scale and Diversity: MoMA for general inference is trained on 7 million instances spanning science, writing, code, and more, automatically labeled for model-pair win/loss/equal outcomes via LLM-based judgment (Guo et al., 9 Sep 2025).
- Knowledge-Enhanced Pipelines: In open-world agents, knowledge graphs are constructed from domain wikis to synthesize planning and action data; expert models (DeepSeek-VL, Grounding DINO, GPT-4) annotate observational trajectories (Li et al., 12 Jun 2025).
- Clinical MoMA: Specialist agents operate zero-shot (no gradient updates); only the final predictor agent is fine-tuned using LoRA on Llama-3, with AdamW optimizer and 8-bit quantization (Gao et al., 7 Aug 2025).
- Data Augmentation: For robust profiling, queries are diversified using BERT-based selection and response generation by multiple LLMs, assigning ordinal comparison labels and optional Elo rankings (Guo et al., 9 Sep 2025).
4. Inference-Time Routing and Execution Pipelines
MoMA's inference pipeline is fully modular and dynamically adapts per-query:
- Generalized Routing (Pseudocode):
- Multimodal Aggregation: Specialist agents summarize their modalities, followed by aggregation and final prediction. All specialist and aggregation steps are zero-shot; inference cost is minimized by restricting fine-tuning to the predictor (Gao et al., 7 Aug 2025).
- Task-Routed MoE: Task classifier activates only the matching expert and shared expert at each layer, ensuring forward passes are sparse and free from cross-task parameter updates (Li et al., 12 Jun 2025).
This sequence ensures both cost-efficiency and broad applicability, adaptively leveraging deterministic tools, specialized models, and generalist LLMs as appropriate.
5. Empirical Results and Quantitative Comparisons
MoMA systems have demonstrated superior performance-cost tradeoffs and robustness across several domains.
- Generalized Routing MoMA (Guo et al., 9 Sep 2025):
- Achieves up to 90% cost reduction versus best single LLM baseline in cost-priority mode; 31–57% cost reduction in auto-routing and performance-priority settings.
- Pareto-optimal model selection often chooses 8B–13B parameter models for typical queries, reserving 235B models for only the hardest cases.
- Outperforms SFT and contrastive routers by matching or exceeding performance at lower cost.
- Scales to sub-100 ms latency at agent-routing stages with only 10–20 ms router overhead.
- Optimus-3 (Embodied Minecraft MoMA) (Li et al., 12 Jun 2025):
- Outperforms GPT-4o, Qwen2.5-VL, and prior task-MoE architectures across long-horizon, planning, captioning, VQA, grounding, and reflection benchmarks.
- Yields up to 3.4× gain in grounding [email protected] and substantially higher scores in planning and embodied QA.
- Ablations confirm that task-level routing eliminates catastrophic interference, preserving prior tasks when new experts are added.
- Clinical MoMA (Gao et al., 7 Aug 2025):
- Macro-F1 for chest trauma prediction: 0.834 (vs. 0.802 best baseline).
- Alcohol screening AUROC: 0.755 (vs. 0.714).
- Remains robust across demographic subgroups.
- Ablations show 0.834→0.778 F1 drop when specialist agent removed, highlighting the necessity of modality-specific summarization.
6. Limitations and Future Extensions
- Generic Limitations:
- Intermediate LLM-generated summaries in the clinical MoMA are not directly validated and pose hallucination risk (Gao et al., 7 Aug 2025).
- Even with sparse gating, MoE Transformers require increased GPU memory and introduce routing overhead (Li et al., 12 Jun 2025).
- No lifelong memory or self-improving meta-controller present in current open-world MoMA systems; experience is not accumulated across missions (Li et al., 12 Jun 2025).
- Modularity and Adaptability:
- Plug-and-play addition of new agents is supported; more advanced gating, iterative aggregation, or mutual critics among agents (multi-turn conversations) are not yet deployed in production (Gao et al., 7 Aug 2025).
- A plausible implication is that extending MoMA to robotics or other embodied domains would primarily require definition and incorporation of new task experts via the same abstraction (Li et al., 12 Jun 2025).
- Validation and Deployment:
- Real-world deployment in healthcare and open-world environments will demand rigorous external validation and monitoring for input drift and fairness (Gao et al., 7 Aug 2025).
Planned future directions include differentiable memory banks, generative task annotation pipelines, domain-adaptive end-to-end training, and conversational agent aggregation (Li et al., 12 Jun 2025, Gao et al., 7 Aug 2025).
7. Conceptual and Practical Significance
MoMA represents a convergence point for several research threads: Mixture-of-Experts (MoE), tool-augmented LLMs, modular agent orchestration, and task-level gating for catastrophic interference avoidance. Its abstraction layer enables multi-domain, multimodal task execution at scale with efficiency unattainable for monolithic models.
By leveraging dedicated experts for modality transformation and agent aggregation, MoMA achieves strong cost-performance tradeoffs in both traditional ("text-to-label") and embodied ("perception–action–reflection") settings. The architecture's plug-and-play nature and minimal data-pairing requirements facilitate rapid extension to new domains, contingent on validation and continued methodological refinement (Guo et al., 9 Sep 2025, Gao et al., 7 Aug 2025, Li et al., 12 Jun 2025).