DoRA: Weight-Decomposed Low-Rank Adaptation (2402.09353v6)

Published 14 Feb 2024 in cs.CL and cs.CV

Abstract: Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.

PDF Abstract

This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) method designed to bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits.

The core idea of DoRA stems from an analysis of how FT and LoRA update model weights. The authors propose decomposing a pre-trained weight matrix $W_0$ into a magnitude component $m$ and a directional component $V$ , such that $W_0 = m \frac{V}{||V||_c}$ , where $||V||_c$ is the vector-wise norm across columns. By analyzing the changes in magnitude ( $\Delta M$ ) and direction ( $\Delta D$ ) during fine-tuning, they observe that FT exhibits a distinct learning pattern compared to LoRA, characterized by a different correlation between magnitude and directional updates (negative for FT, positive for LoRA in their analysis). This difference is hypothesized to contribute to LoRA's lower learning capacity compared to FT.

DoRA leverages this insight by explicitly decomposing the pre-trained weight $W_0$ into its magnitude $m = ||W_0||_c$ and direction $V = W_0$ at initialization. Instead of directly adding a low-rank update $\Delta W$ to $W_0$ as in standard LoRA, DoRA fine-tunes the magnitude component $m$ directly and applies the low-rank update $\Delta V = BA$ to the directional component $V$ . The fine-tuned weight $W'$ is then constructed as:

$W' = \underline{m} \frac{V + \underline{BA}}{||V + \underline{BA}||_c}$

where $m$ and the low-rank matrices $A$ and $B$ are trainable parameters, and $V$ (initialized as $W_0$ ) is kept frozen. This allows DoRA to separately manage magnitude and directional updates, aiming to replicate FT's learning behavior more closely.

Practical Implementation:

Weight Decomposition: For each weight matrix $W_0 \in \mathbb{R}^{d \times k}$ in the pre-trained model where PEFT is applied (typically attention and sometimes feed-forward layers in Transformers), calculate the initial magnitude vector $m \in \mathbb{R}^{1 \times k}$ where each element $m_i = ||W_0[:, i]||_2$ (L2 norm of the i-th column). The initial directional matrix $V$ is simply $W_0$ . $m$ is initialized as a trainable parameter, while $V$ remains fixed.
LoRA Application: Standard LoRA matrices $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ are introduced. They are initialized as in standard LoRA ( $A$ with Kaiming uniform, $B$ with zeros) so that $\Delta V = BA = 0$ initially, making $W'$ equal to $W_0$ . These matrices are trainable.
Forward Pass: To compute the output of a layer with weight $W'$ $W^{'}$ and input $x$ $x$ :
- Compute the updated directional component: $V' = W_0 + BA$ .
- Compute the column-wise norms of $V'$ : $||V'||_c$ .
- Compute the fine-tuned weight: $W' = m \odot \frac{V'}{||V'||_c}$ (where $\odot$ denotes element-wise multiplication after broadcasting $m$ ).
- Compute the output: $y = W'^T x$ .
- Note that the magnitude $m$ is a $1 \times k$ vector, and $V'/||V'||_c$ is $d \times k$ . The magnitude is applied element-wise to the columns of the normalized directional matrix.
Backward Pass (Gradient Calculation): Standard gradient calculations apply to $m$ , $A$ , and $B$ . To reduce memory overhead during backpropagation, the authors propose treating $||V + \Delta V||_c$ as a constant (detaching it from the gradient graph). This means $||V + \Delta V||_c$ is computed based on the current values of $A$ and $B$ , but its gradient is not backpropagated through. The gradient w.r.t $V'$ is $\nabla_{V'} \mathcal{L} \approx \frac{m}{C} \nabla_{W'} \mathcal{L}$ , where $C = ||V'||_c$ is treated as a constant. This modification reduces memory usage with minimal impact on accuracy, as shown in ablation studies (e.g., 24.4% memory reduction for LLaMA tuning with 0.2 accuracy drop).
Inference: Like LoRA, DoRA allows merging the trained parameters ( $m, A, B$ ) into the original weight matrix $W_0$ before deployment. The merged weight $W'_{\text{merged}} = m \odot \frac{W_0 + BA}{||W_0 + BA||_c}$ can be pre-calculated, resulting in a standard $d \times k$ matrix. This means DoRA introduces no additional inference latency compared to the original pre-trained model or merged LoRA.

Real-world Applications and Performance:

The paper demonstrates DoRA's effectiveness across various tasks and model architectures:

Commonsense Reasoning (LLaMA, LLaMA2, LLaMA3): DoRA consistently outperforms LoRA across multiple benchmarks. On LLaMA-7B, DoRA improves average accuracy by 3.7% over LoRA. Even with half the parameters ( $\text{DoRA}^{\dagger}$ ), DoRA surpasses LoRA. This indicates improved learning capacity with comparable or fewer parameters.
Image/Video-Text Understanding (VL-BART): DoRA shows better performance than LoRA (nearly 1% average improvement on image-text, 2% on video-text), reaching performance levels comparable to full fine-tuning but with significantly fewer trainable parameters.
Visual Instruction Tuning (LLaVA-1.5-7B): DoRA achieves higher average accuracy than both LoRA and FT.
Instruction Tuning (LLaMA, LLaMA2): DoRA is compatible with other LoRA variants like VeRA, creating DoVA. DoVA significantly improves over VeRA (which uses very few parameters) and achieves accuracy comparable to or better than standard LoRA with substantially fewer parameters.
Robustness: DoRA consistently outperforms LoRA across different numbers of training samples and varying LoRA ranks, demonstrating its stability and improved performance, particularly at lower ranks where LoRA's performance degrades significantly.
Tuning Granularity: By selectively applying directional updates (LoRA part) only to certain modules (e.g., QKV layers in attention) while updating magnitude for more modules, DoRA can achieve better accuracy than LoRA with a smaller total number of trainable parameters.

Broader Impact (QDoRA):

DoRA's concept has been extended to quantized models. QDoRA combines 4-bit quantization (from QLoRA) with DoRA. Early results on fine-tuning LLaMA2-7B on Orca-Math (Mitra et al., 16 Feb 2024 ) show that QDoRA not only outperforms QLoRA significantly but can even surpass the accuracy of the non-quantized full fine-tuning baseline, all while requiring substantially less GPU memory. This suggests QDoRA is a promising approach for efficiently fine-tuning large models on consumer hardware.

Implementation Considerations:

Implementing DoRA requires modifying the forward and backward passes for the layers where PEFT is applied to incorporate the magnitude and directional decomposition and updates.
The gradient modification (treating $||V + \Delta V||_c$ as constant) is crucial for practical memory efficiency during training.
The trainable parameters are the magnitude vector $m$ (size $1 \times k$ ) and the LoRA matrices $A$ ( $r \times k$ ) and $B$ ( $d \times r$ ). The total trainable parameters are $k + r \times k + d \times r$ . Compared to LoRA ( $r \times k + d \times r$ ), DoRA adds $k$ parameters for the magnitude vector per modified weight matrix.
For deployment, the trained $m, A, B$ are merged into the original $W_0$ to form the final weight matrix $W'$ , which can then be used like any standard weight matrix without special DoRA computation, ensuring zero inference overhead.

In summary, DoRA is a practical and effective PEFT method that improves upon LoRA by explicitly handling magnitude and directional updates based on insights from FT. It achieves better performance with similar or fewer trainable parameters than LoRA, is compatible with LoRA variants, is robust to varying ranks and data sizes, and can be merged for inference, incurring no additional latency. Its extension to quantized fine-tuning (QDoRA) further highlights its practical significance for training large models on resource-constrained hardware.