Rank-Stabilized LoRA (rsLoRA)

Updated 2 December 2025

Rank-Stabilized LoRA (rsLoRA) is a parameter-efficient fine-tuning method that uses a 1/sqrt(r) scaling to maintain stable activations and gradients across diverse adapter ranks.
It addresses the gradient collapse issue of standard LoRA by ensuring stable learning dynamics, which improves performance as adapter rank increases, as validated on models like Llama 2.
The method achieves efficient adaptation without additional inference cost and extends to federated and privacy-preserving settings via adaptations such as FedSVD.

Rank-Stabilized LoRA (rsLoRA) is an improved parameter-efficient fine-tuning (PEFT) methodology for LLMs and other deep neural architectures. It addresses a critical limitation in the canonical Low-Rank Adapter (LoRA) approach, specifically the rank-dependent scaling factor that hinders effective adaptation for higher-rank adapters. By replacing the previously used scaling factor proportional to $1/r$ with a theoretically derived $1/\sqrt{r}$ scaling, rsLoRA enables stable and efficient learning dynamics across a much wider range of adapter ranks, thus facilitating better compute/performance trade-offs without increasing inference costs (Kalajdzievski, 2023).

1. Formulation and Motivation

The standard LoRA method augments a frozen pretrained weight matrix $W \in \mathbb{R}^{d_2 \times d_1}$ with a low-rank correction $\Delta W$ , parameterized as

$\Delta W = \gamma \cdot B A,$

where $B \in \mathbb{R}^{d_2 \times r}$ , $A \in \mathbb{R}^{r \times d_1}$ , and $r \ll \min(d_1, d_2)$ . The scaling factor $\gamma$ is typically set as $\alpha / r$ , with $\alpha$ a constant hyperparameter.

Empirical and theoretical analysis reveal that as $r$ increases, the $1/r$ scaling causes gradients with respect to $B$ and $A$ to collapse as $\mathcal{O}(1/r)$ , dramatically slowing adaptation—effectively nullifying the potential benefits of using higher adapter ranks. Empirically, increasing $r$ in standard LoRA beyond small values (e.g., $r=16$ ) does not improve learning, with loss curves saturating and matching the low-rank case. The rsLoRA framework is motivated by the need to stabilize both the magnitude of forward-pass activations and backward-pass gradients as $r$ grows (Kalajdzievski, 2023).

2. Theoretical Foundation for $\boldsymbol{1/\sqrt{r}}$ Scaling

To ensure activations and gradients remain $\mathcal{O}(1)$ as $r\to\infty$ , the scaling $\gamma_r = \frac{\alpha}{\sqrt r}$ is analytically established:

Forward-pass: Under standard initializations ( $B$ zeros, $A_{ij} \sim \mathcal{N}(0, \sigma_A^2)$ ), the variance of output activations due to $\Delta W x$ is proportional to $\gamma_r^2 r$ . Ensuring $\mathbb{E}[\|\Delta y\|^2] = \Theta(1)$ requires $\gamma_r^2 r = \Theta(1)$ , so $\gamma_r \propto 1/\sqrt{r}$ .
Backward-pass: Gradient magnitudes for $B$ and $A$ similarly scale with $\gamma_r \|\cdot\|$ , with norms in $A$ and $B$ growing as $\mathcal{O}(\sqrt{r})$ . Stability again requires $\gamma_r \sqrt{r} = \Theta(1)$ , enforcing the same scaling.

The main theoretical result (see Appendix, (Kalajdzievski, 2023)) is that only $\gamma_r = \Theta(1/\sqrt{r})$ simultaneously bounds the moments of both activations and gradients for arbitrary rank $r$ . Faster decay (such as $1/r$) collapses gradients; slower ( $r^{-1/4}$ ) causes exploding activations or gradients.

Definition (Rank-Stabilized Adapter): An adapter $\Delta(x) = \gamma_r B A x$ is rank-stabilized if for all orders $m \geq 1$ :

$\mathbb{E}[|x_i|^m] = \Theta(1) \implies \mathbb{E}[|\Delta(x)_j|^m] = \Theta(1)$ ,
$\partial\mathcal{L}/\partial\Delta(x)_j = \Theta(1) \implies \partial\mathcal{L}/\partial x_i = \Theta(1)$

This is provably only satisfied by $\gamma_r \propto 1/\sqrt{r}$ .

3. Implementation Details and Pseudocode

The rsLoRA workflow modifies only the scaling of the adapter relative to the canonical LoRA algorithm. Concretely:

B = zeros(d2, r)
A = Normal(0, σ_A^2) # shape (r, d1)
γ = α / sqrt(r)
for minibatch (x, y_true):
    ΔW = γ * (B @ A)
    y_pred = W @ x + ΔW @ x
    loss = L(y_pred, y_true)
    grad_pred = backward(loss, y_pred)
    grad_B = γ * (grad_pred @ x.T) @ A.T
    grad_A = γ * B.T @ (grad_pred @ x.T)
    B -= η * grad_B
    A -= η * grad_A

The critical difference: set $\gamma = \alpha/\sqrt{r}$ rather than $\alpha/r$ .

Hyperparameters:

Rank $r$ : Select to match GPU budget. Effective range: 4–1024; higher ranks (256–2048) unlock better fine-tuning when rsLoRA is used.
Scaling $\alpha$ : Default $\alpha=1$ . For $r \gg 1024$ , tuning in $[0.5,2]$ is suggested.
Learning rate $\eta$ : As in standard LoRA (e.g., AdamW $\eta \approx 5 \times 10^{-5}$ ).
Initialization $\sigma_A^2$ : Use $1/r_{\rm init}$ as in standard LoRA, or $\mathcal{N}(0,0.02^2)$ .

4. Empirical Results and Performance

Experiments with Llama 2 (7B), using the OpenOrca dataset (20k examples, perplexity metric):

Standard LoRA ( $\gamma = 1/r$ ): Perplexity saturates at $\approx 1.88$ for all $r=4,\ldots,2048$ ; no improvement after $r=16$ .
rsLoRA ( $\gamma=1/\sqrt{r}$ ): Higher ranks progressively improve perplexity: $r=4$ (1.88), $r=32$ (1.87), $r=512$ (1.84), $r=2048$ (1.82).

Gradient-norm diagnostics:

Standard LoRA: $\|\partial\mathcal{L}/\partial B\|$ collapses as $1/r$, leading to extremely slow adaptation at larger $r$ .
rsLoRA: Gradient norms are $\mathcal{O}(1)$ and stable for all $r$ .

Additional ablations confirm:

Scaling only initialization by $1/\sqrt{r}$ , but not the adapter, does not resolve the collapse.
Alternative scaling laws (e.g., $r^{-1/4}$ , $r^{-2}$ ) either explode or collapse activations/gradients more severely.
Restricting LoRA adapters only to attention sublayers preserves the rsLoRA qualitative improvement.

This suggests the benefits of rsLoRA generalize across architectures, datasets, and optimizer choices (Kalajdzievski, 2023).

5. Practical Guidelines and Limitations

Adoption and settings:

Use rsLoRA whenever high adapter rank ( $r \geq 64$ ) is desired to exploit available training compute for improved adaptation, incurring no extra inference cost.
Recommended rank: $r=64$ –$256$ for most scenarios; increase to $512$–$1024$ if memory budget allows.
Maintain conventional learning rates and optimization schedules used for transformer fine-tuning.
No further changes to training paradigms, optimizers, or initialization necessary.

Observed benefits:

Fine-tuning loss/perplexity reductions up to several percentage points as $r$ increases from $8$ to $512$.
rsLoRA achieves performance comparable to or better than full fine-tuning for many NLP tasks, with less than $1$– $5\%$ of model parameters trainable.

Limitations:

For downstream tasks with very low intrinsic dimension ( $\ll r$ ), increased $r$ gives diminishing returns.
rsLoRA addresses only the rank-based scaling issue; it does not mitigate challenges such as domain shift or catastrophic forgetting.

6. Relationship to Federated and Private Settings

While rsLoRA resolves rank-scaling issues in local and centralized applications, when deployed in federated learning with differential privacy mechanisms such as DP-SGD, further adaptation is necessary due to noise amplification through matrix multiplications in LoRA updates:

Quadratic noise terms ( $\xi_B \xi_A$ ) arise when both $A$ and $B$ are locally adapted and independently perturbed on each client.
Freezing one matrix (typically $A$ ) restricts expressiveness and degrades adaptation.

The FedSVD method, introduced in "FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA" (Lee et al., 19 May 2025), orthogonalizes one adapter ( $A$ ) via truncated SVD of the aggregated product $BA$ server-side after each communication round. This ensures:

Only linear noise amplification occurs; the problematic $\xi_B \xi_A$ cross term is eliminated.
Orthonormal $A$ ensures gradient norm preservation under DP-SGD clipping and improves the conditioning of client optimization.
Global SVD-based adaptation of $A$ recovers expressiveness lost in fixed-matrix strategies, delivering improved accuracy and stability under DP constraints.

Empirically, FedSVD achieves 86.27% average accuracy on non-private settings and 76.79% under ( $\epsilon=6$ ) DP-SGD, outperforming other PEFT methods by substantial margins on GLUE benchmarks (Lee et al., 19 May 2025).

7. Impact and Significance

rsLoRA establishes a robust scaling prescription for low-rank adapters, correcting the core deficiency limiting the practical use of higher rank in LoRA-based PEFT. This provides researchers and practitioners with a tunable compute/performance trade-off, enabling efficient model adaptation in scenarios ranging from few-shot supervised tasks to large-sample fine-tuning. Its theoretical foundation ensures stable signal propagation and adaptable learning rates for modern deep models. In federated and privacy-preserving contexts, extensions such as FedSVD provide algorithmic solutions to new sources of instability induced by private noise injection, further broadening the impact and applicability of the rsLoRA scaling regime (Kalajdzievski, 2023, Lee et al., 19 May 2025).