Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian LoRA via Mean-Field VI

Updated 10 February 2026
  • The paper's main contribution is introducing a Bayesian framework that applies mean-field variational inference to Low-Rank Adaptation (LoRA) for uncertainty-aware fine-tuning.
  • It employs a low-rank plus diagonal covariance structure with SWAG-inspired updates to achieve efficient calibration and reduced computational overhead.
  • Experimental results on GLUE tasks demonstrate that the method delivers strong predictive performance with energy savings while robustly quantifying uncertainty.

Bayesian LoRA by Mean-Field Variational Inference is an approach that unifies parameter-efficient fine-tuning and uncertainty quantification in LLMs by applying Bayesian principles to Low-Rank Adaptation (LoRA) using mean-field variational inference (VI). This methodology emphasizes low-dimensional parameterization and storage efficiency, enabling strong predictive performance, calibration, and energy savings in downstream adaptation scenarios.

1. Probabilistic Model Formulation

The foundation of Bayesian LoRA is the adaptation of a frozen pre-trained weight matrix W0Rm×nW^0 \in \mathbb{R}^{m\times n} through a low-rank update: ΔW=ARB,\Delta W = A R B, where ARm×rA \in \mathbb{R}^{m \times r} and BRr×nB \in \mathbb{R}^{r \times n} are fixed projection matrices constructed from a truncated SVD of W0W^0: W0UrSrVrW^0 \approx U_r S_r V_r^\top with A=UrSrA = U_r S_r and B=VrB = V_r^\top. The core trainable (and stochastic) component is the small inner matrix RRr×rR \in \mathbb{R}^{r \times r} for each LoRA module.

The parameters of interest θ\theta are defined as the vectorization of all RR matrices: θ=[vec(R1);vec(R2);].\theta = \left[\mathrm{vec}(R_1); \mathrm{vec}(R_2); \dots \right]. A zero-mean isotropic Gaussian prior is imposed: p(θ)=N(θ0,τ2I).p(\theta) = \mathcal{N}(\theta | 0, \tau^2 I). The likelihood over data D={(xi,yi)}i\mathcal{D} = \{(x_i, y_i)\}_i is given by the standard model output, e.g., softmax loss for classification tasks: p(Dθ)=iSoftmaxLoss(yi,f(xi;W0+ΔW(θ))),p(\mathcal{D} | \theta) = \prod_i \text{SoftmaxLoss}(y_i, f(x_i; W^0 + \Delta W(\theta))), where f(;W)f(\cdot; W) denotes the network with weight WW at each layer. The stochasticity in the weight matrix resides strictly in RR, confining all learning and uncertainty modeling to a small low-dimensional subspace (Marszałek et al., 17 Feb 2025).

2. Variational Family and Posterior Structure

Mean-field VI is employed to approximate the intractable posterior p(θD)p(\theta|\mathcal{D}). In the B-LoRA-XS framework, the variational distribution q(θ)q(\theta) is taken as a Gaussian with a low-rank plus diagonal covariance structure: q(θ)=N(μ,Σ),Σ=UU+diag(ν),q(\theta) = \mathcal{N}(\mu, \Sigma), \qquad \Sigma = UU^\top + \mathrm{diag}(\nu), with URn×kU \in \mathbb{R}^{n \times k} (knk \ll n) capturing a low-dimensional subspace of correlations and νR+n\nu \in \mathbb{R}_+^n providing per-parameter variance. Unlike standard mean-field where all correlations are ignored, this construction retains aggregate global dependencies in a rank-kk subspace while maintaining linear storage cost in nn.

The mean μ\mu is tracked as a running average of θ\theta during post–burn-in epochs, and UU is constructed from the latest kk deviation vectors (denoted D=[d1,...,dk]D = [d_1, ..., d_k] with di=θiμd_i = \theta_i - \mu), following the SWAG moment-matching prescription. Per-parameter variance σ2=E[(θμ)2]\sigma^2 = \mathbb{E}[(\theta - \mu)^2] is estimated in parallel (Marszałek et al., 17 Feb 2025).

This approach can be conceptually reconciled with a mean-field variational family whose marginal posterior for each RR parameter is (up to low-rank cross-covariances) an independent Gaussian, while cross-layer correlations are captured in UU.

3. ELBO Objective and Low-Dimensional Projections

Although the empirical procedure in B-LoRA-XS utilizes SWAG, the underlying optimization closely corresponds to implicit maximization of the evidence lower bound (ELBO): ELBO(μ,U,ν)=EϵN(0,Ik),ηN(0,In)[ilogp(yixi;μ+Uϵ+diag(ν)η)]KL[N(μ,UU+diag(ν))N(0,τ2I)].\operatorname{ELBO}(\mu, U, \nu) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I_k), \eta \sim \mathcal{N}(0, I_n)}\left[\sum_{i} \log p(y_i|x_i; \mu + U\epsilon + \mathrm{diag}(\sqrt{\nu})\eta)\right] - \mathrm{KL}\left[\mathcal{N}(\mu, UU^\top+\mathrm{diag}(\nu)) \,||\, \mathcal{N}(0, \tau^2 I)\right]. In practice, no explicit backpropagation is performed through the KL; instead, the second moments are tracked in SWAG fashion. The fixed SVD projection (AA, BB) ensures that all Bayesian inference occurs in the subspace spanned by RR, and the posterior covariance is constrained to be low rank plus diagonal at scale, minimizing parameter and computational overhead.

The low-rank structure appears both in the weight update and in the construction of Σ\Sigma. These two factors—explicit subspace parameterization of ΔW\Delta W and low intrinsic dimensionality of posterior uncertainty—enable the approach to outperform or match alternatives at a fraction of the memory and compute footprint (Marszałek et al., 17 Feb 2025).

4. Training Algorithm and Gradient Estimation

The training protocol, as aligned with mean-field VI, supports unbiased gradient estimation for qq's parameters via the reparameterization trick: θ(ϵ,η)=μ+Uϵ+diag(ν)η,ϵN(0,Ik),  ηN(0,In).\theta(\epsilon, \eta) = \mu + U\epsilon + \mathrm{diag}(\sqrt{\nu})\eta, \qquad \epsilon \sim \mathcal{N}(0, I_k), \; \eta \sim \mathcal{N}(0, I_n). Gradients with respect to μ,U,ν\mu, U, \nu are in principle estimated by: μ,U,νEθq[logp(Dθ)]=Eϵ,η[μ,U,νlogp(Dθ(ϵ,η))].\nabla_{\mu,U,\nu} \, \mathbb{E}_{\theta \sim q} [\log p(\mathcal{D}|\theta)] = \mathbb{E}_{\epsilon, \eta}\left[\nabla_{\mu,U,\nu} \log p(\mathcal{D}|\theta(\epsilon, \eta))\right]. However, in B-LoRA-XS, SGD is used to update θ\theta directly, and after a burn-in period, low-rank covariances for posterior approximation are accumulated according to SWAG: second moments are constructed from the empirical deviations as model parameters traverse the solution landscape with a fixed learning rate. These statistics define the variational distribution used for Bayesian predictive inference (Marszałek et al., 17 Feb 2025).

A high-level algorithmic loop is as follows:

1
2
3
4
5
6
7
8
9
for epoch in training_epochs:
    for batch in data:
        if past_burn_in:
            # Sample theta from low-rank+diag Gaussian
            theta_sample = mu + U @ eps + sqrt(sq_avg - mu**2) * eta
        else:
            theta_sample = current_theta
        # Compute ΔW_l = A_l R_l B_l, forward/loss/backprop/update
        # SWAG update: accumulate mu, sq_avg, buffer last k deviations

5. Covariance Structure and Parameter Efficiency

For a concatenated θ\theta vector of dimension N=lrl2N = \sum_l r_l^2, a dense covariance would cost O(N2)\mathcal{O}(N^2) in storage. B-LoRA-XS' low-rank plus diagonal approach requires only N(k+2)N(k + 2) parameters (kNk \ll N), achieving linear scaling. The induced covariance of the actual weight update, with vec(ΔW)\mathrm{vec}(\Delta W), is

Cov(vec(ΔW))=(BA) ΣR (BA),\mathrm{Cov}(\mathrm{vec}(\Delta W)) = (B^\top \otimes A)\ \Sigma_R\ (B^\top \otimes A)^\top,

where ΣR\Sigma_R is the covariance in RR–space. Given RR is r×rr \times r and ΣR\Sigma_R is rank kk, ΔW\Delta W's covariance is low rank with at most krk r nonvanishing directions. This property ensures not only memory efficiency but also that almost all uncertainty mass is confined to a tractable, easily sampled region of parameter space (Marszałek et al., 17 Feb 2025).

6. Practical Settings, Empirical Results, and Comparative Context

Empirical validation in (Marszałek et al., 17 Feb 2025) is conducted on four GLUE tasks (RTE, MRPC, CoLA, SST-2) using RoBERTa-large as the backbone, with B-LoRA-XS adapters injected into all major transformer projection submodules (Query, Value, Attention-Output, Output-FC). Typical parameter settings include adaptation ranks r{2,8,16,25}r \in \{2,8,16,25\}, covariance rank k{2,5,10,}k \in \{2,5,10,\ldots\}, AdamW optimization, 10–25 burn-in epochs, and inference via S=15S=15 posterior samples.

Key observed outcomes:

  • B-LoRA-XS matches or exceeds SWAG-LoRA on calibration (ECE) and NLL, with 5–15× fewer Bayesian parameters.
  • The SVD-based low-rank projection effectively disentangles adapted directions, enabling strong uncertainty quantification with kk as low as 2–5.
  • Training is more stable and less sensitive to random seed than SWAG-LoRA.

A related approach, Bayesian-LoRA (Meo et al., 2024), incorporates mean-field VI for both rank and quantization selection via discrete Bayesian gates with Gumbel-sigmoid relaxations, thereby offering flexible per-layer adaptation and further reducing bit-operations required. However, in contrast to B-LoRA-XS, (Meo et al., 2024) does not focus on uncertainty quantification metrics such as ECE or NLL, but rather on adaptive rank/quantization and compute efficiency.

7. Extensions to Quantization and Rank Adaptation

The Bayesian-LoRA framework (Meo et al., 2024) extends mean-field VI in LoRA to include not only Gaussian uncertainty over continuous low-rank factors but also discrete latent variables governing quantization levels and effective ranks. This is achieved by:

  • Factorizing the variational posterior over low-rank matrices (AA, BB), quantization gates (zz), and rank-selection gates (gg).
  • Employing continuous relaxations (straight-through Gumbel-sigmoid) for the Bernoulli gates, enabling end-to-end training with reparameterized gradients.
  • Penalizing KL divergence of quantization/rank gates in the objective, balancing compression and accuracy.

Empirical results in (Meo et al., 2024) show that Bayesian-LoRA achieves competitive or superior task performance to baseline LoRA variants at markedly lower memory and operation counts, with learned adaptations tailored per layer and module. A plausible implication is that the Bayesian VI perspective enables simultaneous adaptation, selective precision allocation, and energy-aware deployment in large-scale LLM fine-tuning.


For technical readers seeking full Bayesian fine-tuning recipes, the B-LoRA-XS framework rigorously specifies the probabilistic model, variational approximation, training regimen, and inference algorithm, realizing both uncertainty quantification and efficiency through judicious low-dimensional projection and mean-field modeling (Marszałek et al., 17 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian LoRA by Mean-Field VI.