Papers
Topics
Authors
Recent
2000 character limit reached

Rank-Stabilized LoRA (rsLoRA)

Updated 2 December 2025
  • Rank-Stabilized LoRA (rsLoRA) is a parameter-efficient fine-tuning method that uses a 1/sqrt(r) scaling to maintain stable activations and gradients across diverse adapter ranks.
  • It addresses the gradient collapse issue of standard LoRA by ensuring stable learning dynamics, which improves performance as adapter rank increases, as validated on models like Llama 2.
  • The method achieves efficient adaptation without additional inference cost and extends to federated and privacy-preserving settings via adaptations such as FedSVD.

Rank-Stabilized LoRA (rsLoRA) is an improved parameter-efficient fine-tuning (PEFT) methodology for LLMs and other deep neural architectures. It addresses a critical limitation in the canonical Low-Rank Adapter (LoRA) approach, specifically the rank-dependent scaling factor that hinders effective adaptation for higher-rank adapters. By replacing the previously used scaling factor proportional to $1/r$ with a theoretically derived 1/r1/\sqrt{r} scaling, rsLoRA enables stable and efficient learning dynamics across a much wider range of adapter ranks, thus facilitating better compute/performance trade-offs without increasing inference costs (Kalajdzievski, 2023).

1. Formulation and Motivation

The standard LoRA method augments a frozen pretrained weight matrix WRd2×d1W \in \mathbb{R}^{d_2 \times d_1} with a low-rank correction ΔW\Delta W, parameterized as

ΔW=γBA,\Delta W = \gamma \cdot B A,

where BRd2×rB \in \mathbb{R}^{d_2 \times r}, ARr×d1A \in \mathbb{R}^{r \times d_1}, and rmin(d1,d2)r \ll \min(d_1, d_2). The scaling factor γ\gamma is typically set as α/r\alpha / r, with α\alpha a constant hyperparameter.

Empirical and theoretical analysis reveal that as rr increases, the $1/r$ scaling causes gradients with respect to BB and AA to collapse as O(1/r)\mathcal{O}(1/r), dramatically slowing adaptation—effectively nullifying the potential benefits of using higher adapter ranks. Empirically, increasing rr in standard LoRA beyond small values (e.g., r=16r=16) does not improve learning, with loss curves saturating and matching the low-rank case. The rsLoRA framework is motivated by the need to stabilize both the magnitude of forward-pass activations and backward-pass gradients as rr grows (Kalajdzievski, 2023).

2. Theoretical Foundation for 1/r\boldsymbol{1/\sqrt{r}} Scaling

To ensure activations and gradients remain O(1)\mathcal{O}(1) as rr\to\infty, the scaling γr=αr\gamma_r = \frac{\alpha}{\sqrt r} is analytically established:

  • Forward-pass: Under standard initializations (BB zeros, AijN(0,σA2)A_{ij} \sim \mathcal{N}(0, \sigma_A^2)), the variance of output activations due to ΔWx\Delta W x is proportional to γr2r\gamma_r^2 r. Ensuring E[Δy2]=Θ(1)\mathbb{E}[\|\Delta y\|^2] = \Theta(1) requires γr2r=Θ(1)\gamma_r^2 r = \Theta(1), so γr1/r\gamma_r \propto 1/\sqrt{r}.
  • Backward-pass: Gradient magnitudes for BB and AA similarly scale with γr\gamma_r \|\cdot\|, with norms in AA and BB growing as O(r)\mathcal{O}(\sqrt{r}). Stability again requires γrr=Θ(1)\gamma_r \sqrt{r} = \Theta(1), enforcing the same scaling.

The main theoretical result (see Appendix, (Kalajdzievski, 2023)) is that only γr=Θ(1/r)\gamma_r = \Theta(1/\sqrt{r}) simultaneously bounds the moments of both activations and gradients for arbitrary rank rr. Faster decay (such as $1/r$) collapses gradients; slower (r1/4r^{-1/4}) causes exploding activations or gradients.

Definition (Rank-Stabilized Adapter): An adapter Δ(x)=γrBAx\Delta(x) = \gamma_r B A x is rank-stabilized if for all orders m1m \geq 1:

  1. E[xim]=Θ(1)    E[Δ(x)jm]=Θ(1)\mathbb{E}[|x_i|^m] = \Theta(1) \implies \mathbb{E}[|\Delta(x)_j|^m] = \Theta(1),
  2. L/Δ(x)j=Θ(1)    L/xi=Θ(1)\partial\mathcal{L}/\partial\Delta(x)_j = \Theta(1) \implies \partial\mathcal{L}/\partial x_i = \Theta(1)

This is provably only satisfied by γr1/r\gamma_r \propto 1/\sqrt{r}.

3. Implementation Details and Pseudocode

The rsLoRA workflow modifies only the scaling of the adapter relative to the canonical LoRA algorithm. Concretely:

1
2
3
4
5
6
7
8
9
10
11
12
B = zeros(d2, r)
A = Normal(0, σ_A^2) # shape (r, d1)
γ = α / sqrt(r)
for minibatch (x, y_true):
    ΔW = γ * (B @ A)
    y_pred = W @ x + ΔW @ x
    loss = L(y_pred, y_true)
    grad_pred = backward(loss, y_pred)
    grad_B = γ * (grad_pred @ x.T) @ A.T
    grad_A = γ * B.T @ (grad_pred @ x.T)
    B -= η * grad_B
    A -= η * grad_A

The critical difference: set γ=α/r\gamma = \alpha/\sqrt{r} rather than α/r\alpha/r.

Hyperparameters:

  • Rank rr: Select to match GPU budget. Effective range: 4–1024; higher ranks (256–2048) unlock better fine-tuning when rsLoRA is used.
  • Scaling α\alpha: Default α=1\alpha=1. For r1024r \gg 1024, tuning in [0.5,2][0.5,2] is suggested.
  • Learning rate η\eta: As in standard LoRA (e.g., AdamW η5×105\eta \approx 5 \times 10^{-5}).
  • Initialization σA2\sigma_A^2: Use 1/rinit1/r_{\rm init} as in standard LoRA, or N(0,0.022)\mathcal{N}(0,0.02^2).

4. Empirical Results and Performance

Experiments with Llama 2 (7B), using the OpenOrca dataset (20k examples, perplexity metric):

  • Standard LoRA (γ=1/r\gamma = 1/r): Perplexity saturates at 1.88\approx 1.88 for all r=4,,2048r=4,\ldots,2048; no improvement after r=16r=16.
  • rsLoRA (γ=1/r\gamma=1/\sqrt{r}): Higher ranks progressively improve perplexity: r=4r=4 (1.88), r=32r=32 (1.87), r=512r=512 (1.84), r=2048r=2048 (1.82).

Gradient-norm diagnostics:

  • Standard LoRA: L/B\|\partial\mathcal{L}/\partial B\| collapses as $1/r$, leading to extremely slow adaptation at larger rr.
  • rsLoRA: Gradient norms are O(1)\mathcal{O}(1) and stable for all rr.

Additional ablations confirm:

  • Scaling only initialization by 1/r1/\sqrt{r}, but not the adapter, does not resolve the collapse.
  • Alternative scaling laws (e.g., r1/4r^{-1/4}, r2r^{-2}) either explode or collapse activations/gradients more severely.
  • Restricting LoRA adapters only to attention sublayers preserves the rsLoRA qualitative improvement.

This suggests the benefits of rsLoRA generalize across architectures, datasets, and optimizer choices (Kalajdzievski, 2023).

5. Practical Guidelines and Limitations

Adoption and settings:

  • Use rsLoRA whenever high adapter rank (r64r \geq 64) is desired to exploit available training compute for improved adaptation, incurring no extra inference cost.
  • Recommended rank: r=64r=64–$256$ for most scenarios; increase to $512$–$1024$ if memory budget allows.
  • Maintain conventional learning rates and optimization schedules used for transformer fine-tuning.
  • No further changes to training paradigms, optimizers, or initialization necessary.

Observed benefits:

  • Fine-tuning loss/perplexity reductions up to several percentage points as rr increases from $8$ to $512$.
  • rsLoRA achieves performance comparable to or better than full fine-tuning for many NLP tasks, with less than $1$–5%5\% of model parameters trainable.

Limitations:

  • For downstream tasks with very low intrinsic dimension (r\ll r), increased rr gives diminishing returns.
  • rsLoRA addresses only the rank-based scaling issue; it does not mitigate challenges such as domain shift or catastrophic forgetting.

6. Relationship to Federated and Private Settings

While rsLoRA resolves rank-scaling issues in local and centralized applications, when deployed in federated learning with differential privacy mechanisms such as DP-SGD, further adaptation is necessary due to noise amplification through matrix multiplications in LoRA updates:

  • Quadratic noise terms (ξBξA\xi_B \xi_A) arise when both AA and BB are locally adapted and independently perturbed on each client.
  • Freezing one matrix (typically AA) restricts expressiveness and degrades adaptation.

The FedSVD method, introduced in "FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA" (Lee et al., 19 May 2025), orthogonalizes one adapter (AA) via truncated SVD of the aggregated product BABA server-side after each communication round. This ensures:

  • Only linear noise amplification occurs; the problematic ξBξA\xi_B \xi_A cross term is eliminated.
  • Orthonormal AA ensures gradient norm preservation under DP-SGD clipping and improves the conditioning of client optimization.
  • Global SVD-based adaptation of AA recovers expressiveness lost in fixed-matrix strategies, delivering improved accuracy and stability under DP constraints.

Empirically, FedSVD achieves 86.27% average accuracy on non-private settings and 76.79% under (ϵ=6\epsilon=6) DP-SGD, outperforming other PEFT methods by substantial margins on GLUE benchmarks (Lee et al., 19 May 2025).

7. Impact and Significance

rsLoRA establishes a robust scaling prescription for low-rank adapters, correcting the core deficiency limiting the practical use of higher rank in LoRA-based PEFT. This provides researchers and practitioners with a tunable compute/performance trade-off, enabling efficient model adaptation in scenarios ranging from few-shot supervised tasks to large-sample fine-tuning. Its theoretical foundation ensures stable signal propagation and adaptable learning rates for modern deep models. In federated and privacy-preserving contexts, extensions such as FedSVD provide algorithmic solutions to new sources of instability induced by private noise injection, further broadening the impact and applicability of the rsLoRA scaling regime (Kalajdzievski, 2023, Lee et al., 19 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Rank-Stabilized LoRA (rsLoRA).