Awaker2.5-VL: Addressing Multi-Task Conflict in Multimodal LLMs through a Mixture of Experts Architecture
The paper presents Awaker2.5-VL, a Multimodal LLM (MLLM) that employs a Mixture of Experts (MoE) architecture to tackle the challenges associated with handling diverse textual and visual tasks. As a response to the prevalent "multi-task conflict" issue—wherein performance degrades across various tasks due to the heterogeneity in data representation and distribution—the authors propose a method to enhance the task-specific capabilities of MLLMs.
Methodology and Model Architecture
Awaker2.5-VL utilizes a Mixture of Experts (MoE) architecture characterized by multiple sparsely activated experts. This design aims to provide task-specific abilities to the model by dynamically activating and deactivating experts through a gating network. Importantly, a global expert remains active throughout, ensuring the model retains its versatility. Each expert is constructed using a low-rank adaptation (LoRA) structure, which is instrumental in reducing the training and inference costs due to MoE's sparse nature.
The training strategy is built around freezing the base model during the learning of MoE and LoRA modules, a measure that significantly reduces training costs. The authors employ a novel routing strategy that simplifies typical MoE structures in LLMs by utilizing instance-level activation rather than token-level activation.
Experimental Evaluation and Results
The experimental results are compelling, with Awaker2.5-VL exhibiting state-of-the-art performance across several recent benchmarks. Specifically, it achieves superior results on the MME-Realworld and MMBench benchmarks, demonstrating significant improvements over other models, including its base model, Qwen2-VL-7B-Instruct. Notably, in Chinese-language benchmarks such as MME-Realworld-CN, Awaker2.5-VL outperforms other models in both perception and reasoning tasks. This underscores the effectiveness of the proposed MoE strategy in managing multimodal tasks.
Implications and Future Work
The introduction of Awaker2.5-VL marks an incremental advancement in the domain of MLLMs by addressing the multi-task conflict through an MoE architecture. The authors highlight potential improvements in the routing process, suggesting advancements in prompt representation for enhanced performance. Further exploration into integrating the MoE architecture within the Vision Transformer (ViT) aspect of the model is also envisaged.
These improvements may offer insights into the broader implications for AI development, particularly in optimizing performance across diverse datasets through efficient parameter utilization. As research evolves, the methods proposed could serve as a foundational strategy for developing cost-effective, scalable models addressing heterogeneous task requirements in multimodal AI systems.