Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency (2506.12724v1)

Published 15 Jun 2025 in cs.CV

Abstract: Multimodal Large Models (MLLMs) have achieved remarkable progress in vision-language understanding and generation tasks. However, existing MLLMs typically rely on static modality fusion strategies, which treat all modalities equally regardless of their instance-level reliability or semantic contribution. This often leads to suboptimal performance, especially in scenarios with noisy, missing, or misaligned modalities. In this paper, we propose Dynamic Modality Scheduling (DMS), a novel framework that adaptively adjusts the contribution of each modality at a per-sample level. DMS evaluates each modality based on three key factors: (1) \textit{confidence}, estimated from predictive entropy; (2) \textit{uncertainty}, obtained via Monte Carlo dropout; and (3) \textit{semantic consistency}, computed through inter-modal similarity. These signals are combined through a learnable or rule-based scheduler to generate soft modality weights used in downstream fusion.To ensure stable training, we further introduce a \textit{Modality Weight Consistency Loss}, which regularizes the fused representation to stay close to unimodal embeddings proportionally to their assigned weights. Our method is model-agnostic and can be integrated into existing MLLMs such as BLIP-2 and LLaVA. Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance, especially under modality corruption or dropout conditions. This work provides a general and effective mechanism to enable instance-aware and robustness-enhanced multimodal modeling.

Summary

We haven't generated a summary for this paper yet.