Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 101 tok/s
Gemini 2.5 Pro 59 tok/s Pro
GPT-5 Medium 31 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Bernoulli-LoRA: Randomized Low-Rank Adaptation

Updated 10 August 2025
  • Bernoulli-LoRA is a randomized, parameter-efficient low-rank adaptation framework that leverages Bernoulli-based updates to fine-tune large-scale models.
  • It introduces a probabilistic mechanism where each training step selectively updates one of the low-rank factors, enhancing both analytical tractability and efficiency.
  • Empirical and theoretical analyses confirm that Bernoulli-LoRA achieves competitive convergence rates and reduced parameter optimization in diverse settings.

Bernoulli-LoRA is a randomized and theoretically grounded framework for parameter-efficient adaptation of large-scale models via low-rank decomposition. It builds upon the established paradigm of Low-Rank Adaptation (LoRA), in which fine-tuning is accomplished by updating model weights through the addition of a low-rank correction, avoiding the need to store or optimize the full parameter set. The novelty of Bernoulli-LoRA lies in its probabilistic mechanism for factor updates—at each iteration, a Bernoulli random variable determines whether to adapt the left or right low-rank matrix, yielding a unified and analytically tractable family that includes and extends existing LoRA strategies. Explicit convergence guarantees are established for a variety of variants, including deterministic and stochastic settings (Bernoulli-LoRA-GD, -SGD), variance-reduced estimators (Bernoulli-LoRA-PAGE, -MVR), and federated/distributed scenarios (Bernoulli-LoRA-QGD, -MARINA, -EF21). Empirical validation demonstrates that this framework maintains or improves adaptation efficiency and accuracy while providing analytical tools for understanding and optimizing the fine-tuning process (Sokolov et al., 5 Aug 2025).

1. Unified Theoretical Framework

Bernoulli-LoRA generalizes the LoRA framework by introducing a randomization mechanism into the adaptation process. Standard LoRA expresses an updated weight matrix WW as a low-rank update over a pre-trained W0W^0:

W=W0+αrBAW = W^0 + \frac{\alpha}{r} B A

where ARd×rA \in \mathbb{R}^{d \times r} and BRr×kB \in \mathbb{R}^{r \times k} are factor matrices and α\alpha is a scaling parameter. In canonical LoRA, either AA or BB is optimized while the other is fixed, or both are updated in an alternating fashion as in RAC-LoRA (Malinovsky et al., 10 Oct 2024).

Bernoulli-LoRA, in contrast, introduces a Bernoulli random variable ctBe(p)c^t \sim \operatorname{Be}(p) at each training step tt, determining whether the "left-sketch" (optimizing AA with BB fixed) or "right-sketch" (optimizing BB with AA fixed) update is performed. Concretely:

  • With probability pp, BstB_s^t is fixed (sampled from a predetermined distribution), and AA is optimized.
  • With probability $1-p$, AstA_s^t is fixed, and BB is optimized.

Each update can be written as a projected gradient step:

W(t+1)=W(t)γG^(t)W^{(t+1)} = W^{(t)} - \gamma \hat{G}^{(t)}

where

G^(t)={HB(t)f(W(t))with probability p f(W(t))HA(t)with probability 1p\hat{G}^{(t)} = \begin{cases} H^{B^{(t)}} \nabla f(W^{(t)}) & \text{with probability } p \ \nabla f(W^{(t)}) H^{A^{(t)}} & \text{with probability } 1-p \end{cases}

with HA(t)=Ast(Ast)AstH^{A^{(t)}} = A_s^t (A_s^t)^\dagger A_s^t and HB(t)=Bst(Bst)BstH^{B^{(t)}} = B_s^t (B_s^t)^\dagger B_s^t (Moore–Penrose pseudoinverse). In expectation, if AsA_s and BsB_s are sampled i.i.d. from an appropriate distribution (e.g., Gaussian), the projections are contractive:

E[H]=rnIn\mathbb{E}[H] = \frac{r}{n} I_n

establishing a lower bound on the progress made in each descent direction.

2. Algorithmic Variants and Convergence Guarantees

The Bernoulli-LoRA framework encompasses several algorithmic variants, each defined by the choice of gradient estimator and projection scheme. Principal variants analyzed in (Sokolov et al., 5 Aug 2025) include:

Variant Gradient Estimator Notable Features
Bernoulli-LoRA-GD Full batch gradient Deterministic, sublinear/linear rate under PL
Bernoulli-LoRA-SGD Unbiased stochastic gradient Mini-batch, stochastic, error controlled by stepsize
Bernoulli-LoRA-PAGE PAGE estimator Periodic full batch, variance reduction
Bernoulli-LoRA-MVR Momentum-based variance reduction MVR on gradient estimators
Fed-Bernoulli-LoRA-QGD/MARINA/EF21 Quantized or compressed federated gradients Applicable to distributed/federated learning

The theoretical convergence rates are made explicit. For smooth non-convex f(W)f(W), the expected squared gradient norm at a random iterate TT obeys:

E[f(W~T)]2Δ0γλminpT\mathbb{E}\left[\|\nabla f(\tilde{W}^T)\|\right] \leq \frac{2 \Delta^0}{\gamma \lambda_{\min}^p T}

with λminp=pλmin(HB)+(1p)λmin(HA)\lambda_{\min}^p = p \lambda_{\min}^{(H^B)} + (1-p)\lambda_{\min}^{(H^A)} and Δ0=f(W0)f\Delta^0 = f(W^0) - f^*. For stochastic settings using SGD, an additional variance term appears:

E[f(W~T)]6Δ0γλminpT+γLC1λmaxpλminp\mathbb{E}\left[\|\nabla f(\tilde{W}^T)\|\right] \leq \frac{6\Delta^0}{\gamma \lambda_{\min}^p T} + \gamma L C_1 \frac{\lambda_{\max}^p}{\lambda_{\min}^p}

For functions satisfying the Polyak–Łojasiewicz (PL) condition (1/2)f(W)2μ(f(W)f)(1/2)\|\nabla f(W)\|^2 \geq \mu (f(W) - f^*), geometric (linear) convergence is established:

f(WT)f[1γμλminp]T(f(W0)f)f(W^T) - f^* \leq [1 - \gamma \mu \lambda_{\min}^p]^T (f(W^0) - f^*)

Empirical studies confirm that these rates are practically reflected in training, and variance-reduced estimators such as PAGE and MVR deliver improved performance, especially in noisy or federated settings.

3. Optimization Analysis and Spectral Properties

The analysis in (Sokolov et al., 5 Aug 2025) is based on the spectral properties of the projection matrices HA(t)H^{A^{(t)}} and HB(t)H^{B^{(t)}}. The efficiency of the update and the contraction property in the descent direction depend on:

  • The distribution used to sample AstA_s^t and BstB_s^t
  • The sketch rank rr (with the minimal eigenvalue scaling as r/nr/n)
  • The Bernoulli parameter pp

Different settings—smooth non-convex, convex nonsmooth, and PL—are covered:

  • For L-Lipschitz smooth ff, descent is controlled by the expected projection.
  • For convex 0\ell_0-Lipschitz ff, subgradient methods maintain O(1/T)O(1/\sqrt{T}) convergence when using constant stepsizes.
  • The optimal stepsize is problem-dependent but can be fixed or chosen adaptively (Polyak-type).

Selection of pp and sketch distribution impacts both empirical performance and theoretical rates; e.g., balancing updates or emphasizing one factor may be advantageous depending on model or dataset.

4. Empirical Evaluation

Bernoulli-LoRA is validated in several experimental scenarios:

  • On a synthetic linear regression problem with non-convex regularization, Bernoulli-LoRA-PAGE shows accelerated convergence and escapes stagnation relative to naive SGD.
  • For MNIST, a multilayer perceptron is pre-trained on digits $0$–$4$ and adapted to classify digits $5$–$9$ via LoRA derivatives. Bernoulli-LoRA delivers accuracy comparable to RAC-LoRA while, due to probabilistic factor selection, training fewer parameters in expectation: pA+(1p)Bp \cdot |A| + (1-p) \cdot |B|.
  • Experimental results are robust across hyperparameter settings, batch sizes, and randomness in sketching, corroborating the theoretical predictions.

These findings demonstrate Bernoulli-LoRA's practical viability and its ability to deliver efficient fine-tuning without sacrificing accuracy or adaptability, while also requiring, on average, fewer parameters to be optimized at each iteration.

5. Relation to Existing Low-Rank Adaptation and PEFT Methods

Bernoulli-LoRA is positioned as a unified and flexible generalization of prior LoRA approaches:

  • Deterministic LoRA: Fixes one factor and optimizes the other in all steps; a special case as p1p \to 1 or p0p \to 0.
  • Alternating/RAC-LoRA: Alternates the factor to be optimized in a fixed schedule (Malinovsky et al., 10 Oct 2024).
  • PEFT extensions: Other approaches such as COLA use communication-efficient compression or error feedback, and Bernoulli-LoRA has variants (e.g., QGD, EF21) that incorporate these mechanisms, allowing principled application to federated/distributed settings.

Advantages of Bernoulli-LoRA include:

  • Algorithmic flexibility: The Bernoulli parameter pp allows interpolation between existing adaptation schemes.
  • Theoretical tractability: Unified analysis admits direct bounds on convergence rate and parameter efficiency.
  • Modular variance reduction and distributed extensions: PAGE, MVR, and federated variants are fully integrated, each with explicit convergence results.

However, randomness introduces an additional hyperparameter and can lead to training variability if not tuned appropriately. The spectral properties of sampled matrices impact learning dynamics, and the approach relies on the efficacy of sketches in capturing model adaptations.

6. Summary and Significance

Bernoulli-LoRA provides a rigorous theoretical framework for randomized low-rank adaptation in large-scale models, offering a unifying perspective on LoRA, RAC-LoRA, and related PEFT techniques. It achieves parameter efficiency by stochastically updating only a subset of low-rank factors at each step, with convergence properties controlled by the spectral characteristics of sketching matrices and the Bernoulli update probability. Analytical results encompass both classical and modern optimization methods (GD, SGD, variance reduction, federated learning), extending LoRA’s applicability and interpretability in practical scenarios. Empirical benchmarks validate the theoretical claims and demonstrate that Bernoulli-LoRA matches or exceeds state-of-the-art PEFT techniques with enhanced flexibility and a lower expected parameter footprint (Sokolov et al., 5 Aug 2025).