Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Published 24 Feb 2025 in cs.CL | (2502.16894v3)

Abstract: While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for LLMs, its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a novel framework that combines adaptive SVD-based priors and a Mixture-of-Experts (MoE) architecture to align LoRA gradient dynamics with full fine-tuning.
The method uses SVD components to initialize MoE experts, which dynamically select task-relevant priors and employ a closed-form scaling strategy for proper weight alignment.
Experiments across 25 datasets demonstrate that this approach significantly closes the performance gap with full fine-tuning while substantially reducing the parameter footprint.

This paper introduces a framework that bolsters traditional LoRA fine-tuning by fusing adaptive SVD-based priors with a Mixture-of-Experts (MoE) architecture to align its gradient dynamics with those of full fine-tuning.

It decomposes the pre-trained weight matrix into segmented SVD components to initialize distinct MoE experts that dynamically select task-relevant priors.
It derives a closed-form scaling strategy that aligns the low-rank equivalent gradients with full tuning gradients, ensuring proper weight alignment across experts.
Extensive experiments across 25 datasets in domains such as image classification, natural language generation, commonsense reasoning, and natural language understanding demonstrate improvements that close the performance gap with full fine-tuning while substantially reducing the parameter footprint.