Papers
Topics
Authors
Recent
Search
2000 character limit reached

M3LLM: Model Context Protocol-aided Mixture of Vision Experts For Multimodal LLMs in Networks

Published 3 Aug 2025 in cs.NI | (2508.01805v1)

Abstract: Current Multimodal LLMs (MLLMs) rely on centralized architectures and often suffer from poor alignment between the input task and their fixed visual encoding modules, which limits performance on diverse and dynamic visual tasks. With the increasing deployment of resource-efficient models on edge devices in wireless networks, a new opportunity emerges to dynamically use distributed vision experts for improved MLLM inference quality. To enable this, we propose M3LLM, where the Model Context Protocol (MCP) coordinates a mixture of vision experts to achieve distributed MLLMs. Specifically, MCP is an open protocol that structures the input task context into interpretable representations, enabling wireless network-aware coordination between the central model backbone and edge-hosted vision experts. Based on the MCP representation, M3LLM formulates vision expert routing as a joint optimization problem that balances task-expert semantic compatibility and channel performance. To solve the resulting gradient conflicts, we develop a dual-stream Soft Actor-Critic (SAC) algorithm with decoupled reward signals and introduce an Adaptive Stability Enhancement Module (ASEM) based on hierarchical Bayesian modeling to ensure effective routing. Experiments show that M3LLM improves task accuracy, reduces communication cost, and enhances expert routing adaptability under dynamic wireless network conditions.

Summary

  • The paper presents a novel architecture using MCP to route vision experts for task-specific, wireless-aware multimodal inference.
  • It pioneers a dual-stream DRL framework with CE-SAC and ASEM to optimize both semantic compatibility and network quality.
  • Performance evaluations demonstrate up to 51% accuracy improvements over state-of-the-art MLLMs, validating its adaptive design.

M3^3LLM: Model Context Protocol-aided Mixture of Vision Experts For Multimodal LLMs in Networks

The paper presents M3^3LLM, a novel distributed architecture designed for multimodal LLMs (MLLMs), incorporating a mixture of vision experts coordinated through wireless networks. The method leverages the Model Context Protocol (MCP) for efficient expert routing and employs advanced reinforcement learning techniques to optimize for both task relevance and network conditions.

Design and Architecture

System Model

M3^3LLM integrates various vision experts distributed across edge devices within wireless networks, enabling dynamic and task-specific multimodal inference. The architecture decouples vision experts from a centralized model, allowing distributed, network-aware optimization. The MCP structures contextual information into a unified, wireless-aware representation to facilitate communication among components. Figure 1

Figure 1: The system model of M3^3LLM in wireless networks.

Expert Routing

The routing mechanism in M3^3LLM follows a two-stage hierarchical approach:

  1. Coarse-Grained Expert Selection: The initial stage involves a semantic-based filtering mechanism, assisted by Retrieval-Augmented Generation (RAG), to shortlist vision experts that are semantically relevant to the current task.
  2. Fine-Grained Routing Optimization: The second stage involves a dual-objective DRL framework using Channel-Expert Soft Actor-Critic (CE-SAC) to select the optimal set of vision experts by considering both task-expert semantic compatibility and wireless network quality. Figure 2

    Figure 2: Expert routing scheme of M3^3LLM. Stage 1 performs coarse-grained expert filtering via MCP-aided RAG. Stage 2 executes network-aware fine-grained expert routing using a decoupled DRL agent, i.e., CE-SAC, which leverages a stability-aware state representation from ASEM.

Methodology

Context Protocol and Semantic Filtering

MCP enables structured communication and contextual encoding, capturing task semantics, device capabilities, and channel states. By leveraging these rich context encodings, M3^3LLM dynamically selects vision experts tailored to the current query's requirements.

Dual-Stream DRL Optimization

The CE-SAC algorithm utilizes two separate critic networks for task and communication objectives, preventing interference and ensuring stable learning. The Adaptive Stability Enhancement Module (ASEM) provides robust latent representations to mitigate the adverse effects of wireless channel dynamics.

Experimental Evaluation

The experimental results demonstrate that M3^3LLM substantially enhances multimodal task accuracy by up to 51% compared with state-of-the-art MLLMs like MoVA. These improvements are attributed to M3^3LLM's capability to adapt expert routing strategies based on real-time wireless network conditions and task demands. Figure 3

Figure 3: Training dynamics comparison across 1,000 episodes. (a) Total reward evolution showing M3^3LLM's superior convergence and final performance. (b) LLM reward demonstrating consistent semantic quality improvements. (c) Channel reward highlighting M3^3LLM's unique ability to optimize network quality while baselines remain static. Shaded areas represent confidence intervals.

Performance Analysis

The integration of MCP and CE-SAC allows M3^3LLM to achieve a balanced optimization between semantic task performance and channel robustness. The architecture's use of ASEM contributes significantly to managing channel volatility and ensuring expert selection is not only semantically effective but also network-aware.

Conclusion

M3^3LLM presents a significant advancement in the deployment of MLLMs in distributed wireless environments. By integrating MCP with innovative DRL approaches, M3^3LLM effectively coordinates vision experts, maximizing task performance while adapting to dynamic network conditions. Future work could explore extending this architecture to incorporate federated learning and more adaptive compression strategies for even greater performance.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.