Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs (2502.20035v1)

Published 27 Feb 2025 in cs.CV

Abstract: Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal LLM (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.

AsymLoRA introduces a parameter-efficient tuning framework for MLLM instruction fine-tuning, designed to address the challenges of data conflicts and commonalities inherent in diverse image-text datasets.

  • The paper introduces AsymLoRA, an asymmetric LoRA architecture that uses task-specific B matrices to address conflicts and a shared A matrix to capture commonalities, outperforming both vanilla LoRA and LoRA-MoE in MLLM fine-tuning.
  • AsymLoRA achieves a TextVQA score of 55.51% and enhanced MME scores (Perception: 1327.93, Cognition: 287.14) on single-domain conversation tasks, demonstrating improved multimodal reasoning and feature extraction capabilities, and attains the highest accuracy (59.60%) on GQA while minimizing distribution shift (1.50).
  • Experimental results across single-domain and multi-domain settings demonstrate AsymLoRA's superior performance in integrating textual and visual cues, handling diverse multimodal challenges, and adapting dynamically to different domains while preserving effective knowledge transfer, achieving a TextVQA score of 54.25% and VizWiz average of 38.10% in multi-task settings.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xuyang Wei (1 paper)
  2. Chunlin Tian (16 papers)
  3. Li Li (657 papers)