Riemannian Preconditioned LoRA Optimization

Updated 24 February 2026

Riemannian Preconditioned LoRA is an advanced method that integrates geometry-based preconditioning into low-rank adaptation to improve numerical stability and convergence.
It modifies update rules using the fixed-rank manifold structure, addressing ill-conditioning through small, efficient preconditioners.
Practical implementations show enhanced performance in language and diffusion models with improved robustness and minimal computational overhead.

Riemannian Preconditioned LoRA refers to the incorporation of Riemannian geometry-derived preconditioning strategies into the update rules of Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning algorithm for large-scale neural models. This approach improves the numerical stability, convergence speed, and robustness of LoRA by exploiting the geometry of the fixed-rank matrix manifold and introducing explicit small-dimensional preconditioners based on the Riemannian metric structure. These methods have demonstrated practical benefits in LLMs, diffusion models, and general low-rank adaptation tasks (Zhang et al., 2024, Park et al., 25 Aug 2025, Bogachev et al., 16 Jul 2025, Almansoori et al., 18 Feb 2026, Bioli et al., 2024).

1. Background: Low-Rank Adaptation and Geometric Ambiguities

LoRA fine-tuning replaces a frozen pretrained parameter matrix $W \in \mathbb{R}^{m \times n}$ with $X = W + A B^\top$ , where $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{n \times r}$ for small $r \ll \min(m, n)$ . Only $A$ and $B$ are updated during training. Yet, the parameterization $(A,B)$ is not unique due to the transformation $(A O, B O^{-T})$ for $O \in GL(r)$ , making the set of all $X$ of rank $r$ a quotient manifold. This gauge symmetry complicates optimization and can result in ill-conditioned updates if the local geometry is ignored (Zhang et al., 2024, Bogachev et al., 16 Jul 2025).

Standard LoRA employs plain Euclidean gradient descent, resulting in update steps that may become ill-scaled or converge slowly if $A^\top A$ or $B^\top B$ are ill-conditioned. Such problems motivate the introduction of preconditioning, particularly one that respects the underlying Riemannian geometry of the fixed-rank matrix manifold.

2. Riemannian Metric and Geometry of Fixed-Rank Manifolds

The set of matrices of fixed rank $r$ forms a smooth submanifold $M_r = \{ X \in \mathbb{R}^{m \times n} : \text{rank}(X) = r \}$ . On $M_r$ , a canonical or quotient Riemannian metric can be defined. For the factorized parameterization, the Mishra-Sepulchre metric evaluates tangent vectors $(\Delta A, \Delta B)$ as

$g_{[A,B]}( (\Delta A, \Delta B), (\Delta A', \Delta B')) = \operatorname{Tr}(B^\top B\,\, \Delta A^\top \Delta A') + \operatorname{Tr}(A^\top A\,\, \Delta B^\top \Delta B').$

This metric is invariant under the gauge transformation and directly reflects ill-conditioning in the parameter space, directly motivating preconditioners (Zhang et al., 2024).

On general $M_r$ , the ambient Euclidean metric induces projections onto the tangent space for preconditioning, as in (Bogachev et al., 16 Jul 2025, Bioli et al., 2024). In Stiefel-constrained parameterizations (e.g., $B^\top B=I$ ), updates are restricted to the tangent space of the Stiefel manifold, further stabilizing optimization (Park et al., 25 Aug 2025).

3. Riemannian Preconditioner Construction and Update Formulas

The Riemannian gradient of the objective $f(A,B) = L(W + AB^\top)$ under the Mishra-Sepulchre metric is given by

$\operatorname{grad}^R_A f = \nabla_A L \cdot (B^\top B)^{-1}, \qquad \operatorname{grad}^R_B f = \nabla_B L \cdot (A^\top A)^{-1},$

where $\nabla_A L = (\nabla_X L) B$ and $\nabla_B L = (\nabla_X L)^\top A$ . The $r \times r$ matrices $B^\top B$ and $A^\top A$ act as coordinate-specific preconditioners.

A preconditioned first-order step thus becomes

$\Delta A = -\eta \nabla_A L (B^\top B + \epsilon I)^{-1}, \qquad \Delta B = -\eta \nabla_B L (A^\top A + \epsilon I)^{-1},$

with a small $\epsilon$ for numerical stability. This mechanism is computationally efficient, adding negligible overhead for $r \lesssim 64$ (Zhang et al., 2024, Almansoori et al., 18 Feb 2026).

In Stiefel-LoRA, orthogonality-constrained variants use tangent-space projections and QR-based retractions to preserve $B^\top B = I$ , and apply adaptive moment preconditioning (e.g., AdamW) in the ambient space before projection (Park et al., 25 Aug 2025).

4. Preconditioning in Riemannian Optimization: Variants and Algorithms

Recent work extends Riemannian preconditioning to LoRA optimization by varying the choice of manifold, metric, and update rule (Zhang et al., 2024, Bogachev et al., 16 Jul 2025, Park et al., 25 Aug 2025, Almansoori et al., 18 Feb 2026, Bioli et al., 2024):

Quotient/Embedded Manifold Approaches: Project Euclidean gradients onto the tangent space of $M_r$ using explicit projections that ignore rank-changing directions, yielding geometric scaling intrinsic to the fixed-rank structure (Bogachev et al., 16 Jul 2025, Bioli et al., 2024).
Stiefel Optimization: Impose explicit $B^\top B=I$ constraints to prevent basis redundancy or rank collapse. Riemannian gradients are computed and Adam-style adaptive scaling is performed in the tangent space, with manifold retractions enforcing constraints (Park et al., 25 Aug 2025).
Proximal and Alternating Least Squares: Proximal subspace iterations (LoRSum, F-LoRSum) treat the LoRA update as a sequence of alternated $r \times r$ linear solves with (potentially structured) diagonal or Kronecker-factored metrics, providing further flexibility in constructing K-FAC or Shampoo-derived preconditioners (Almansoori et al., 18 Feb 2026).
Metric Change and Tangent-Space Preconditioning: Generalize preconditioning to use Kronecker-product operators ( $\mathcal{P}(X) = E X D$ ), Sylvester-type operators, or approximate tangent-space solvers (e.g., tangent-space ADI), with weighted projections and retractions matching the induced geometry (Bioli et al., 2024).

A comparative table summarizing preconditioning approaches:

Approach	Metric/Constraint	Preconditioner
Quotient/Canonical (Mishra-Sepulchre)	Canonical Riemannian	$(B^\top B)^{-1}, (A^\top A)^{-1}$
Stiefel LoRA	$B^\top B = I$ (Stiefel)	Tangent-space projection + Adam
Proximal Subspace Iteration (LoRSum, F-LoRSum)	Structured (K-FAC, Shampoo)	$D_U$ , $D_V$ (diagonals/Kronecker)
Tangent-space ADI/CG	Weighted/Structured SPD operator	Sylvester/ADI solve on tangent space

5. Theoretical Guarantees and Empirical Performance

A key theoretical result (Theorem 5.4 of (Zhang et al., 2024)) proves that in a two-layer infinite-width regime with suitable restricted isometry properties and spectral initialization, Riemannian preconditioned (scaled) gradient descent enjoys linear convergence rates independent of data conditioning: $\|F^t - F^*\| \leq (1-0.57)^t\, \text{dist}_0,$ where each eigen-direction contracts uniformly due to the metric scaling.

Empirically, Riemannian preconditioned LoRA methods consistently outperform their unconstrained or naively preconditioned counterparts across diverse tasks. For example (Zhang et al., 2024):

Language modeling (GPT-2, Mistral 7B, LLaMA-3.2-1B): Preconditioned SGD and AdamW yield up to $+$ 1.5 ROUGE-L, and +2–3 average accuracy points on GLUE benchmarks compared to vanilla LoRA.
Commonsense and math tasks: Stiefel-LoRA outperforms AdamW-LoRA by 5–15 accuracy points (Park et al., 25 Aug 2025), with consistent gains in tasks such as SQuAD and GSM8K.
Diffusion models: Riemannian preconditioned LoRA maintains stability for large learning rates and produces higher-fidelity generations (Zhang et al., 2024, Bogachev et al., 16 Jul 2025).
Robustness and efficiency: These methods demonstrate strong insensitivity to learning rate/hyperparameter variation, and the stepwise computational overhead is negligible (≃1% additional time for $r\le 64$ ) (Zhang et al., 2024, Almansoori et al., 18 Feb 2026).

6. Practical Implementation and Recommendations

Rank selection: $r=4$ –16 is typically sufficient. Preconditioner cost of $O(r^3)$ per update is negligible at this scale (Zhang et al., 2024).
Initialization: Random Gaussian with orthonormalization for Stiefel methods; locally optimal tangent-aligned initializations (e.g., truncated or randomized SVD) for generic Riemannian variants (Bogachev et al., 16 Jul 2025, Park et al., 25 Aug 2025).
Learning rates: Larger step sizes are tolerated, especially for SGD ( $\eta$ up to $1e$–1), with preconditioned AdamW allowing faster convergence without instability (Zhang et al., 2024).
Coding: Implementation requires only small modifications—typically 5–10 lines—to accommodate Riemannian preconditioning in standard deep learning frameworks (Zhang et al., 2024). Ensure correct pairing of A/B factors, add small regularizers (e.g., $\epsilon = 1e$ –6), and restrict all inverses/solves to $r \times r$ matrices.
Variants: Rank scheduling, momentum buffering, and structured preconditioners (e.g., half-KFAC) can be combined with Riemannian principles for further gains (Almansoori et al., 18 Feb 2026).

7. Extensions and Unified Perspectives

Riemannian Preconditioned LoRA is closely related to a broader landscape of low-rank manifold optimization techniques. Unified frameworks such as RiemannLoRA (Bogachev et al., 16 Jul 2025) and preconditioned Riemannian CG (Bioli et al., 2024) formalize LoRA optimization as generic smooth manifold learning, providing a principled perspective on initialization, overparameterization removal, stability, and adaptive rank control.

Proximal subspace iteration approaches (Almansoori et al., 18 Feb 2026) generalize the Riemannian preconditioning concept to structured, memory-efficient metrics, unifying several recent LoRA enhancements. Tangent-space ADI and metric change strategies further expand the class of effective preconditioners, adapting to the structure of the underlying model, dataset, or loss landscape (Bioli et al., 2024).

A plausible implication is that continued advances in Riemannian preconditioned methods will further increase the reliability and efficiency of large-model fine-tuning, with rapidly diminishing gaps between structured SVD-based updates, parameter-efficient fine-tuning, and full Riemannian optimization.

Key sources: (Zhang et al., 2024, Park et al., 25 Aug 2025, Bogachev et al., 16 Jul 2025, Almansoori et al., 18 Feb 2026, Bioli et al., 2024).