AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

Published 9 May 2026 in cs.LG, cs.AI, and cs.CL | (2605.08734v1)

Abstract: Low-Rank Adaptation (LoRA) reparameterizes a weight update as a product of two low-rank factors, but the Jacobian $J_{G}$ of the generator mapping the factors to the weight matrix is rank-deficient, so the factor-space preconditioner $J_{G}^* {F}t J{G}$ induced by any ${W}$-space preconditioner ${F}t$ is singular, and consequently the standard chain rule cannot be uniquely inverted to map a preconditioned ${W}$-space direction back to a factor-space update. We cast existing LoRA optimizers in a unified framework parameterized by two choices: (i) which invertible surrogate for $J{G}^* {F}t J{G}$ to use, and (ii) which ${F}t$ on ${W}$ to use. Existing methods occupy four families along these axes: factor-space adaptive updates, block-diagonal surrogates for $J{G}^* J_{G}$, Frobenius-residual pseudoinverse methods, and Riemannian manifold constraint. Within this design space, a gradient-statistics-aware ${F}_t$ paired with a closed-form factor-space solve at ${O}((m+n)r)$ memory remains underexplored. We propose \textbf{AdaPreLoRA}, which fills this gap by adopting the Adafactor diagonal Kronecker preconditioner ${H}_t$ on ${W}$ and selecting from the resulting factor-space solution family the element minimizing an ${H}_t$-weighted imbalance between the two factor contributions; by construction, the resulting factor update is the closest LoRA approximation to the preconditioned ${W}$-space direction under the ${H}_t$-weighted norm. Across GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization, AdaPreLoRA is competitive with or improves over a representative set of LoRA optimizers while keeping peak GPU memory at the LoRA optimizer level.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces AdaPreLoRA, an innovative low-rank adaptation optimizer that integrates Adafactor-based preconditioning to reduce memory usage.
It employs a closed-form update rule that minimizes factor imbalance, matching full-statistic preconditioning performance at a fraction of the memory cost.
Empirical benchmarks on language models and diffusion personalization show that AdaPreLoRA outperforms existing PEFT methods in both output quality and efficiency.

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

Introduction and Motivation

Parameter-efficient fine-tuning (PEFT) methods are crucial for adapting large pre-trained networks, especially LLMs and diffusion models, under strict memory and compute constraints. Low-Rank Adaptation (LoRA) is the dominant PEFT paradigm, decomposing the update to each weight matrix $\bm{W}$ into a product of low-rank factors ( $\bm{B} \bm{A}$ ) to reduce trainable parameter and optimizer state costs from $\mathcal{O}(mn)$ to $\mathcal{O}((m+n)r)$ . However, optimizing in this factor-space introduces a nontrivial geometric challenge: the generator Jacobian $J_{\mathcal{G}}$ , mapping factor updates to weight-space, is always rank-deficient due to inherent gauge redundancy, leading to a singular induced preconditioner $J_{\mathcal{G}}^* \mathcal{F}_t J_{\mathcal{G}}$ for any nontrivial $\mathcal{F}_t$ built from weight-space gradient statistics. As a result, constructing optimizers for LoRA that incorporate adaptive, gradient-statistics-aware preconditioning but remain within LoRA's memory budget is nontrivial.

This work provides a unified framework for classifying LoRA optimizers, parameterized by (i) the choice of weight-space preconditioner and (ii) the rule for selecting a particular update within the affine solution set allowed by the rank-deficiency. Within this framework, previous optimizers either ignore gradient-statistical structure for cheap factor updates (vanilla LoRA, Imbalance-Reg, etc.) or require $\mathcal{O}(mn)$ memory (full Adafactor/Adam, Shampoo, K-FAC, LoRA-Pro AdamW).

The AdaPreLoRA Algorithm

AdaPreLoRA occupies an unexploited point in this design space: it employs the Adafactor diagonal Kronecker preconditioner (i.e., the cheapest possible gradient-statistics-based structure in $\mathcal{O}(m+n)$ memory) and selects, within the solution set to the preconditioned least-squares system, the factor update minimizing an imbalance criterion in the induced norm. The derivation proceeds as follows:

The desired update solves the system $J_{\mathcal{G}}^* \mathcal{F}_t J_{\mathcal{G}}[\Delta_{\bm{B}}, \Delta_{\bm{A}}] = J_{\mathcal{G}}^*(\bm{G})$ , which is highly non-unique due to the $\bm{B} \bm{A}$ 0-dimensional kernel.
For $\bm{B} \bm{A}$ 1 in diagonal Kronecker (Adafactor) form $\bm{B} \bm{A}$ 2, the projection to the expressive LoRA tangent subspace can be computed efficiently; all solutions induce the same weight update but trace different (factor) trajectories.
AdaPreLoRA selects the update minimizing the $\bm{B} \bm{A}$ 3-weighted norm of the mismatch between the two factor contributions, yielding a closed-form expression involving the Adafactor preconditioner and appropriate projectors.

The resulting update matches the direction a full Adafactor step would take, projected into the tangent subspace allowed by LoRA, but avoids materializing or maintaining $\bm{B} \bm{A}$ 4-size statistics. This is in contrast to LoRA-Pro, which either incurs full memory or abandons consistent preconditioning.

Empirical Analysis

AdaPreLoRA is extensively benchmarked against representative algorithms in both the AdamW and SGD update families on a suite of models and tasks:

LLMs: Fine-tuning GPT-2 (small/medium), Mistral-7B, and Qwen2-7B on E2E, DART, GLUE (RTE, CoLA, MRPC), ARC, and GSM8K.
Diffusion Personalization: Mix-of-Show framework with LoRA for tuning Stable Diffusion backbones on image generation benchmarks, where visual and metric-based quality are assessed.

Across all scenarios, with memory cost strictly matching or undercutting competitor methods:

AdaPreLoRA consistently matches or exceeds the best baselines, often by nontrivial margins in downstream metrics (e.g., BLEU, GLUE, CLIP, FID).
It is the only method in the class of $\bm{B} \bm{A}$ 5 optimizers to systematically close the gap to full-statistics (Adafactor/Adam/LoRA-Pro AdamW) variants, which demand $\bm{B} \bm{A}$ 6 GPU memory at 7B-model scale and above.
Visual generations (see below) demonstrate significantly improved prompt and subject fidelity compared to alternative optimizers at fixed compute, supporting the practical utility in both language and vision tasks.

Figure 1: AdaPreLoRA generates visually coherent and correctly grounded images for text prompts, outperforming other AdamW-based LoRA optimizers in fidelity to character, action, and scene descriptions ("Harry Potter is walking near Mount Fuji").

Figure 2: Face and compositional quality for "Hermione Granger on the beach" is enhanced by AdaPreLoRA compared to other AdamW-based factor optimizers.

Figure 3: AdaPreLoRA yields superior output quality even in the SGD-based optimizer setting, both in prompt fidelity and identity preservation ("Harry Potter standing near the lake").

Figure 4: Prioritized image quality in AdaPreLoRA for SGD-based optimization is evident, particularly in facial synthesis and attribute correctness ("Hermione Granger wearing a brown shirt").

Figure 5: AdaPreLoRA with AdamW and reduced scaling factor ( $\bm{B} \bm{A}$ 7) still outperforms; images correctly capture accessories ("Harry Potter wearing a brown hat") and remain visually plausible.

Figure 6: Robust superiority over other AdamW-based optimizers by AdaPreLoRA for "Hermione Granger on the beach", consistent across scaling factors and scenes.

Theoretical and Practical Implications

AdaPreLoRA’s construction demonstrates that gradient-statistics-aware LoRA optimization is achievable in strict memory budgets via structured preconditioning and an appropriate affine selection rule. At the theoretical level, the paper’s framework exposes the unifying structure of existing LoRA optimizers as special cases, clarifying the geometric constraints imposed by the LoRA manifold and showing that principled preconditioning is possible even in the presence of factorization-gauge redundancy.

Practically, AdaPreLoRA breaks the empirical trade-off observed in previous work: it delivers the increases of adaptive statistics-based optimizers without incurring prohibitive memory overhead. This unlocks high-performing PEFT at the 7B-scale and above on single-node hardware, and establishes a stronger baseline for downstream tasks, including in high-fidelity personalized generative modeling.

Future Directions

Potential extensions include:

Generalization to Mixture-of-Experts and QLoRA/quantized variants, necessitating local or dequantization-aware second-moment statistics.
Adapting the approach to transformer-based or cross-attention heavy diffusion architectures, where structural or conditional statistics may further improve performance.
Systematic analysis of other selection rules within affine solution sets, and further exploitation of geometric properties for manifold-constrained PEFT in other domains.

Conclusion

AdaPreLoRA advances the design of LoRA optimizers by integrating an efficient, consistent, and fully closed-form Adafactor-based preconditioning rule, achieving empirical and theoretical superiority at PEFT memory budgets. Its results across language and vision fine-tuning tasks consistently validate the approach and suggest it as the new standard for scalable LoRA-based adaptation in large generative models.

(2605.08734)

Markdown Report Issue