- The paper introduces AdaPreLoRA, an innovative low-rank adaptation optimizer that integrates Adafactor-based preconditioning to reduce memory usage.
- It employs a closed-form update rule that minimizes factor imbalance, matching full-statistic preconditioning performance at a fraction of the memory cost.
- Empirical benchmarks on language models and diffusion personalization show that AdaPreLoRA outperforms existing PEFT methods in both output quality and efficiency.
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation
Introduction and Motivation
Parameter-efficient fine-tuning (PEFT) methods are crucial for adapting large pre-trained networks, especially LLMs and diffusion models, under strict memory and compute constraints. Low-Rank Adaptation (LoRA) is the dominant PEFT paradigm, decomposing the update to each weight matrix W into a product of low-rank factors (BA) to reduce trainable parameter and optimizer state costs from O(mn) to O((m+n)r). However, optimizing in this factor-space introduces a nontrivial geometric challenge: the generator Jacobian JG​, mapping factor updates to weight-space, is always rank-deficient due to inherent gauge redundancy, leading to a singular induced preconditioner JG∗​Ft​JG​ for any nontrivial Ft​ built from weight-space gradient statistics. As a result, constructing optimizers for LoRA that incorporate adaptive, gradient-statistics-aware preconditioning but remain within LoRA's memory budget is nontrivial.
This work provides a unified framework for classifying LoRA optimizers, parameterized by (i) the choice of weight-space preconditioner and (ii) the rule for selecting a particular update within the affine solution set allowed by the rank-deficiency. Within this framework, previous optimizers either ignore gradient-statistical structure for cheap factor updates (vanilla LoRA, Imbalance-Reg, etc.) or require O(mn) memory (full Adafactor/Adam, Shampoo, K-FAC, LoRA-Pro AdamW).
The AdaPreLoRA Algorithm
AdaPreLoRA occupies an unexploited point in this design space: it employs the Adafactor diagonal Kronecker preconditioner (i.e., the cheapest possible gradient-statistics-based structure in O(m+n) memory) and selects, within the solution set to the preconditioned least-squares system, the factor update minimizing an imbalance criterion in the induced norm. The derivation proceeds as follows:
- The desired update solves the system JG∗​Ft​JG​[ΔB​,ΔA​]=JG∗​(G), which is highly non-unique due to the BA0-dimensional kernel.
- For BA1 in diagonal Kronecker (Adafactor) form BA2, the projection to the expressive LoRA tangent subspace can be computed efficiently; all solutions induce the same weight update but trace different (factor) trajectories.
- AdaPreLoRA selects the update minimizing the BA3-weighted norm of the mismatch between the two factor contributions, yielding a closed-form expression involving the Adafactor preconditioner and appropriate projectors.
The resulting update matches the direction a full Adafactor step would take, projected into the tangent subspace allowed by LoRA, but avoids materializing or maintaining BA4-size statistics. This is in contrast to LoRA-Pro, which either incurs full memory or abandons consistent preconditioning.
Empirical Analysis
AdaPreLoRA is extensively benchmarked against representative algorithms in both the AdamW and SGD update families on a suite of models and tasks:
- LLMs: Fine-tuning GPT-2 (small/medium), Mistral-7B, and Qwen2-7B on E2E, DART, GLUE (RTE, CoLA, MRPC), ARC, and GSM8K.
- Diffusion Personalization: Mix-of-Show framework with LoRA for tuning Stable Diffusion backbones on image generation benchmarks, where visual and metric-based quality are assessed.
Across all scenarios, with memory cost strictly matching or undercutting competitor methods:
- AdaPreLoRA consistently matches or exceeds the best baselines, often by nontrivial margins in downstream metrics (e.g., BLEU, GLUE, CLIP, FID).
- It is the only method in the class of BA5 optimizers to systematically close the gap to full-statistics (Adafactor/Adam/LoRA-Pro AdamW) variants, which demand BA6 GPU memory at 7B-model scale and above.
- Visual generations (see below) demonstrate significantly improved prompt and subject fidelity compared to alternative optimizers at fixed compute, supporting the practical utility in both language and vision tasks.



















Figure 1: AdaPreLoRA generates visually coherent and correctly grounded images for text prompts, outperforming other AdamW-based LoRA optimizers in fidelity to character, action, and scene descriptions ("Harry Potter is walking near Mount Fuji").


















Figure 2: Face and compositional quality for "Hermione Granger on the beach" is enhanced by AdaPreLoRA compared to other AdamW-based factor optimizers.


















Figure 3: AdaPreLoRA yields superior output quality even in the SGD-based optimizer setting, both in prompt fidelity and identity preservation ("Harry Potter standing near the lake").


















Figure 4: Prioritized image quality in AdaPreLoRA for SGD-based optimization is evident, particularly in facial synthesis and attribute correctness ("Hermione Granger wearing a brown shirt").


















Figure 5: AdaPreLoRA with AdamW and reduced scaling factor (BA7) still outperforms; images correctly capture accessories ("Harry Potter wearing a brown hat") and remain visually plausible.


















Figure 6: Robust superiority over other AdamW-based optimizers by AdaPreLoRA for "Hermione Granger on the beach", consistent across scaling factors and scenes.
Theoretical and Practical Implications
AdaPreLoRA’s construction demonstrates that gradient-statistics-aware LoRA optimization is achievable in strict memory budgets via structured preconditioning and an appropriate affine selection rule. At the theoretical level, the paper’s framework exposes the unifying structure of existing LoRA optimizers as special cases, clarifying the geometric constraints imposed by the LoRA manifold and showing that principled preconditioning is possible even in the presence of factorization-gauge redundancy.
Practically, AdaPreLoRA breaks the empirical trade-off observed in previous work: it delivers the increases of adaptive statistics-based optimizers without incurring prohibitive memory overhead. This unlocks high-performing PEFT at the 7B-scale and above on single-node hardware, and establishes a stronger baseline for downstream tasks, including in high-fidelity personalized generative modeling.
Future Directions
Potential extensions include:
- Generalization to Mixture-of-Experts and QLoRA/quantized variants, necessitating local or dequantization-aware second-moment statistics.
- Adapting the approach to transformer-based or cross-attention heavy diffusion architectures, where structural or conditional statistics may further improve performance.
- Systematic analysis of other selection rules within affine solution sets, and further exploitation of geometric properties for manifold-constrained PEFT in other domains.
Conclusion
AdaPreLoRA advances the design of LoRA optimizers by integrating an efficient, consistent, and fully closed-form Adafactor-based preconditioning rule, achieving empirical and theoretical superiority at PEFT memory budgets. Its results across language and vision fine-tuning tasks consistently validate the approach and suggest it as the new standard for scalable LoRA-based adaptation in large generative models.
(2605.08734)