Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DoRA: Weight-Decomposed Low-Rank Adaptation (2402.09353v6)

Published 14 Feb 2024 in cs.CL and cs.CV

Abstract: Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.

This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) method designed to bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits.

The core idea of DoRA stems from an analysis of how FT and LoRA update model weights. The authors propose decomposing a pre-trained weight matrix W0W_0 into a magnitude component mm and a directional component VV, such that W0=mVVcW_0 = m \frac{V}{||V||_c}, where Vc||V||_c is the vector-wise norm across columns. By analyzing the changes in magnitude (ΔM\Delta M) and direction (ΔD\Delta D) during fine-tuning, they observe that FT exhibits a distinct learning pattern compared to LoRA, characterized by a different correlation between magnitude and directional updates (negative for FT, positive for LoRA in their analysis). This difference is hypothesized to contribute to LoRA's lower learning capacity compared to FT.

DoRA leverages this insight by explicitly decomposing the pre-trained weight W0W_0 into its magnitude m=W0cm = ||W_0||_c and direction V=W0V = W_0 at initialization. Instead of directly adding a low-rank update ΔW\Delta W to W0W_0 as in standard LoRA, DoRA fine-tunes the magnitude component mm directly and applies the low-rank update ΔV=BA\Delta V = BA to the directional component VV. The fine-tuned weight WW' is then constructed as:

W=mV+BAV+BAcW' = \underline{m} \frac{V + \underline{BA}}{||V + \underline{BA}||_c}

where mm and the low-rank matrices AA and BB are trainable parameters, and VV (initialized as W0W_0) is kept frozen. This allows DoRA to separately manage magnitude and directional updates, aiming to replicate FT's learning behavior more closely.

Practical Implementation:

  1. Weight Decomposition: For each weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k} in the pre-trained model where PEFT is applied (typically attention and sometimes feed-forward layers in Transformers), calculate the initial magnitude vector mR1×km \in \mathbb{R}^{1 \times k} where each element mi=W0[:,i]2m_i = ||W_0[:, i]||_2 (L2 norm of the i-th column). The initial directional matrix VV is simply W0W_0. mm is initialized as a trainable parameter, while VV remains fixed.
  2. LoRA Application: Standard LoRA matrices ARr×kA \in \mathbb{R}^{r \times k} and BRd×rB \in \mathbb{R}^{d \times r} are introduced. They are initialized as in standard LoRA (AA with Kaiming uniform, BB with zeros) so that ΔV=BA=0\Delta V = BA = 0 initially, making WW' equal to W0W_0. These matrices are trainable.
  3. Forward Pass: To compute the output of a layer with weight WW' and input xx:
    • Compute the updated directional component: V=W0+BAV' = W_0 + BA.
    • Compute the column-wise norms of VV': Vc||V'||_c.
    • Compute the fine-tuned weight: W=mVVcW' = m \odot \frac{V'}{||V'||_c} (where \odot denotes element-wise multiplication after broadcasting mm).
    • Compute the output: y=WTxy = W'^T x.
    • Note that the magnitude mm is a 1×k1 \times k vector, and V/VcV'/||V'||_c is d×kd \times k. The magnitude is applied element-wise to the columns of the normalized directional matrix.
  4. Backward Pass (Gradient Calculation): Standard gradient calculations apply to mm, AA, and BB. To reduce memory overhead during backpropagation, the authors propose treating V+ΔVc||V + \Delta V||_c as a constant (detaching it from the gradient graph). This means V+ΔVc||V + \Delta V||_c is computed based on the current values of AA and BB, but its gradient is not backpropagated through. The gradient w.r.t VV' is VLmCWL\nabla_{V'} \mathcal{L} \approx \frac{m}{C} \nabla_{W'} \mathcal{L}, where C=VcC = ||V'||_c is treated as a constant. This modification reduces memory usage with minimal impact on accuracy, as shown in ablation studies (e.g., 24.4% memory reduction for LLaMA tuning with 0.2 accuracy drop).
  5. Inference: Like LoRA, DoRA allows merging the trained parameters (m,A,Bm, A, B) into the original weight matrix W0W_0 before deployment. The merged weight Wmerged=mW0+BAW0+BAcW'_{\text{merged}} = m \odot \frac{W_0 + BA}{||W_0 + BA||_c} can be pre-calculated, resulting in a standard d×kd \times k matrix. This means DoRA introduces no additional inference latency compared to the original pre-trained model or merged LoRA.

Real-world Applications and Performance:

The paper demonstrates DoRA's effectiveness across various tasks and model architectures:

  • Commonsense Reasoning (LLaMA, LLaMA2, LLaMA3): DoRA consistently outperforms LoRA across multiple benchmarks. On LLaMA-7B, DoRA improves average accuracy by 3.7% over LoRA. Even with half the parameters (DoRA\text{DoRA}^{\dagger}), DoRA surpasses LoRA. This indicates improved learning capacity with comparable or fewer parameters.
  • Image/Video-Text Understanding (VL-BART): DoRA shows better performance than LoRA (nearly 1% average improvement on image-text, 2% on video-text), reaching performance levels comparable to full fine-tuning but with significantly fewer trainable parameters.
  • Visual Instruction Tuning (LLaVA-1.5-7B): DoRA achieves higher average accuracy than both LoRA and FT.
  • Instruction Tuning (LLaMA, LLaMA2): DoRA is compatible with other LoRA variants like VeRA, creating DoVA. DoVA significantly improves over VeRA (which uses very few parameters) and achieves accuracy comparable to or better than standard LoRA with substantially fewer parameters.
  • Robustness: DoRA consistently outperforms LoRA across different numbers of training samples and varying LoRA ranks, demonstrating its stability and improved performance, particularly at lower ranks where LoRA's performance degrades significantly.
  • Tuning Granularity: By selectively applying directional updates (LoRA part) only to certain modules (e.g., QKV layers in attention) while updating magnitude for more modules, DoRA can achieve better accuracy than LoRA with a smaller total number of trainable parameters.

Broader Impact (QDoRA):

DoRA's concept has been extended to quantized models. QDoRA combines 4-bit quantization (from QLoRA) with DoRA. Early results on fine-tuning LLaMA2-7B on Orca-Math (Mitra et al., 16 Feb 2024 ) show that QDoRA not only outperforms QLoRA significantly but can even surpass the accuracy of the non-quantized full fine-tuning baseline, all while requiring substantially less GPU memory. This suggests QDoRA is a promising approach for efficiently fine-tuning large models on consumer hardware.

Implementation Considerations:

  • Implementing DoRA requires modifying the forward and backward passes for the layers where PEFT is applied to incorporate the magnitude and directional decomposition and updates.
  • The gradient modification (treating V+ΔVc||V + \Delta V||_c as constant) is crucial for practical memory efficiency during training.
  • The trainable parameters are the magnitude vector mm (size 1×k1 \times k) and the LoRA matrices AA (r×kr \times k) and BB (d×rd \times r). The total trainable parameters are k+r×k+d×rk + r \times k + d \times r. Compared to LoRA (r×k+d×rr \times k + d \times r), DoRA adds kk parameters for the magnitude vector per modified weight matrix.
  • For deployment, the trained m,A,Bm, A, B are merged into the original W0W_0 to form the final weight matrix WW', which can then be used like any standard weight matrix without special DoRA computation, ensuring zero inference overhead.

In summary, DoRA is a practical and effective PEFT method that improves upon LoRA by explicitly handling magnitude and directional updates based on insights from FT. It achieves better performance with similar or fewer trainable parameters than LoRA, is compatible with LoRA variants, is robust to varying ranks and data sizes, and can be merged for inference, incurring no additional latency. Its extension to quantized fine-tuning (QDoRA) further highlights its practical significance for training large models on resource-constrained hardware.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chien-Yi Wang (29 papers)
  2. Hongxu Yin (49 papers)
  3. Pavlo Molchanov (70 papers)
  4. Yu-Chiang Frank Wang (88 papers)
  5. Kwang-Ting Cheng (96 papers)
  6. Min-Hung Chen (41 papers)
  7. Shih-yang Liu (10 papers)
Citations (209)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit