AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs (2502.20035v1)

Published 27 Feb 2025 in cs.CV

Abstract: Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal LLM (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.

PDF Abstract

AsymLoRA introduces a parameter-efficient tuning framework for MLLM instruction fine-tuning, designed to address the challenges of data conflicts and commonalities inherent in diverse image-text datasets.

The paper introduces AsymLoRA, an asymmetric LoRA architecture that uses task-specific B matrices to address conflicts and a shared A matrix to capture commonalities, outperforming both vanilla LoRA and LoRA-MoE in MLLM fine-tuning.
AsymLoRA achieves a TextVQA score of 55.51% and enhanced MME scores (Perception: 1327.93, Cognition: 287.14) on single-domain conversation tasks, demonstrating improved multimodal reasoning and feature extraction capabilities, and attains the highest accuracy (59.60%) on GQA while minimizing distribution shift (1.50).
Experimental results across single-domain and multi-domain settings demonstrate AsymLoRA's superior performance in integrating textual and visual cues, handling diverse multimodal challenges, and adapting dynamically to different domains while preserving effective knowledge transfer, achieving a TextVQA score of 54.25% and VizWiz average of 38.10% in multi-task settings.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Xuyang Wei (1 paper)
Chunlin Tian (16 papers)
Li Li (657 papers)

GitHub

GitHub - Clin0212/HydraLoRA: [NeurIPS'24 Oral] HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning (164 stars)

AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs (2502.20035v1)

Related Papers

GitHub