AsymLoRA introduces a parameter-efficient tuning framework for MLLM instruction fine-tuning, designed to address the challenges of data conflicts and commonalities inherent in diverse image-text datasets.
- The paper introduces AsymLoRA, an asymmetric LoRA architecture that uses task-specific B matrices to address conflicts and a shared A matrix to capture commonalities, outperforming both vanilla LoRA and LoRA-MoE in MLLM fine-tuning.
- AsymLoRA achieves a TextVQA score of 55.51% and enhanced MME scores (Perception: 1327.93, Cognition: 287.14) on single-domain conversation tasks, demonstrating improved multimodal reasoning and feature extraction capabilities, and attains the highest accuracy (59.60%) on GQA while minimizing distribution shift (1.50).
- Experimental results across single-domain and multi-domain settings demonstrate AsymLoRA's superior performance in integrating textual and visual cues, handling diverse multimodal challenges, and adapting dynamically to different domains while preserving effective knowledge transfer, achieving a TextVQA score of 54.25% and VizWiz average of 38.10% in multi-task settings.