This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) method designed to bridge the performance gap between standard LoRA and full fine-tuning (FT) while retaining LoRA's efficiency benefits.
The core idea of DoRA stems from an analysis of how FT and LoRA update model weights. The authors propose decomposing a pre-trained weight matrix into a magnitude component and a directional component , such that , where is the vector-wise norm across columns. By analyzing the changes in magnitude () and direction () during fine-tuning, they observe that FT exhibits a distinct learning pattern compared to LoRA, characterized by a different correlation between magnitude and directional updates (negative for FT, positive for LoRA in their analysis). This difference is hypothesized to contribute to LoRA's lower learning capacity compared to FT.
DoRA leverages this insight by explicitly decomposing the pre-trained weight into its magnitude and direction at initialization. Instead of directly adding a low-rank update to as in standard LoRA, DoRA fine-tunes the magnitude component directly and applies the low-rank update to the directional component . The fine-tuned weight is then constructed as:
where and the low-rank matrices and are trainable parameters, and (initialized as ) is kept frozen. This allows DoRA to separately manage magnitude and directional updates, aiming to replicate FT's learning behavior more closely.
Practical Implementation:
- Weight Decomposition: For each weight matrix in the pre-trained model where PEFT is applied (typically attention and sometimes feed-forward layers in Transformers), calculate the initial magnitude vector where each element (L2 norm of the i-th column). The initial directional matrix is simply . is initialized as a trainable parameter, while remains fixed.
- LoRA Application: Standard LoRA matrices and are introduced. They are initialized as in standard LoRA ( with Kaiming uniform, with zeros) so that initially, making equal to . These matrices are trainable.
- Forward Pass: To compute the output of a layer with weight and input :
- Compute the updated directional component: .
- Compute the column-wise norms of : .
- Compute the fine-tuned weight: (where denotes element-wise multiplication after broadcasting ).
- Compute the output: .
- Note that the magnitude is a vector, and is . The magnitude is applied element-wise to the columns of the normalized directional matrix.
- Backward Pass (Gradient Calculation): Standard gradient calculations apply to , , and . To reduce memory overhead during backpropagation, the authors propose treating as a constant (detaching it from the gradient graph). This means is computed based on the current values of and , but its gradient is not backpropagated through. The gradient w.r.t is , where is treated as a constant. This modification reduces memory usage with minimal impact on accuracy, as shown in ablation studies (e.g., 24.4% memory reduction for LLaMA tuning with 0.2 accuracy drop).
- Inference: Like LoRA, DoRA allows merging the trained parameters () into the original weight matrix before deployment. The merged weight can be pre-calculated, resulting in a standard matrix. This means DoRA introduces no additional inference latency compared to the original pre-trained model or merged LoRA.
Real-world Applications and Performance:
The paper demonstrates DoRA's effectiveness across various tasks and model architectures:
- Commonsense Reasoning (LLaMA, LLaMA2, LLaMA3): DoRA consistently outperforms LoRA across multiple benchmarks. On LLaMA-7B, DoRA improves average accuracy by 3.7% over LoRA. Even with half the parameters (), DoRA surpasses LoRA. This indicates improved learning capacity with comparable or fewer parameters.
- Image/Video-Text Understanding (VL-BART): DoRA shows better performance than LoRA (nearly 1% average improvement on image-text, 2% on video-text), reaching performance levels comparable to full fine-tuning but with significantly fewer trainable parameters.
- Visual Instruction Tuning (LLaVA-1.5-7B): DoRA achieves higher average accuracy than both LoRA and FT.
- Instruction Tuning (LLaMA, LLaMA2): DoRA is compatible with other LoRA variants like VeRA, creating DoVA. DoVA significantly improves over VeRA (which uses very few parameters) and achieves accuracy comparable to or better than standard LoRA with substantially fewer parameters.
- Robustness: DoRA consistently outperforms LoRA across different numbers of training samples and varying LoRA ranks, demonstrating its stability and improved performance, particularly at lower ranks where LoRA's performance degrades significantly.
- Tuning Granularity: By selectively applying directional updates (LoRA part) only to certain modules (e.g., QKV layers in attention) while updating magnitude for more modules, DoRA can achieve better accuracy than LoRA with a smaller total number of trainable parameters.
Broader Impact (QDoRA):
DoRA's concept has been extended to quantized models. QDoRA combines 4-bit quantization (from QLoRA) with DoRA. Early results on fine-tuning LLaMA2-7B on Orca-Math (Mitra et al., 16 Feb 2024 ) show that QDoRA not only outperforms QLoRA significantly but can even surpass the accuracy of the non-quantized full fine-tuning baseline, all while requiring substantially less GPU memory. This suggests QDoRA is a promising approach for efficiently fine-tuning large models on consumer hardware.
Implementation Considerations:
- Implementing DoRA requires modifying the forward and backward passes for the layers where PEFT is applied to incorporate the magnitude and directional decomposition and updates.
- The gradient modification (treating as constant) is crucial for practical memory efficiency during training.
- The trainable parameters are the magnitude vector (size ) and the LoRA matrices () and (). The total trainable parameters are . Compared to LoRA (), DoRA adds parameters for the magnitude vector per modified weight matrix.
- For deployment, the trained are merged into the original to form the final weight matrix , which can then be used like any standard weight matrix without special DoRA computation, ensuring zero inference overhead.
In summary, DoRA is a practical and effective PEFT method that improves upon LoRA by explicitly handling magnitude and directional updates based on insights from FT. It achieves better performance with similar or fewer trainable parameters than LoRA, is compatible with LoRA variants, is robust to varying ranks and data sizes, and can be merged for inference, incurring no additional latency. Its extension to quantized fine-tuning (QDoRA) further highlights its practical significance for training large models on resource-constrained hardware.