Papers
Topics
Authors
Recent
Search
2000 character limit reached

Laplace-LoRA: Bayesian Uncertainty in LoRA

Updated 6 February 2026
  • Laplace-LoRA is a Bayesian procedure that applies Laplace approximation to estimate a Gaussian posterior over low-rank adapter parameters in LLMs.
  • It leverages efficient low-dimensional inference and Kronecker-factored representations to significantly reduce overconfidence and calibration error on small or out-of-distribution datasets.
  • The method combines MAP optimization and structured posterior estimation to deliver calibrated predictive distributions with only modest computational overhead.

Laplace-LoRA is a post-hoc Bayesian procedure for uncertainty quantification in low-rank adaptation (LoRA) of LLMs. It leverages the Laplace approximation to estimate a Gaussian posterior over LoRA adapter parameters, providing robust calibration for fine-tuned LLMs with minimal computational overhead. Laplace-LoRA combines efficient low-dimensional Bayesian inference with Kronecker-factored representations, significantly reducing overconfidence and expected calibration error (ECE), particularly when models are adapted on small or out-of-distribution datasets (Yang et al., 2023).

1. LoRA Parametrization and Motivation for Bayesian Uncertainty

LLMs contain large affine weight matrices W0Rdout×dinW_0\in\mathbb{R}^{d_{\rm out}\times d_{\rm in}}. LoRA introduces an efficient adaptation by learning an additive low-rank update to these matrices:

h=W0a+ΔWa,ΔW=BA,\mathbf{h} = W_0\,a + \Delta W\,a,\qquad \Delta W = B\,A,

where ARr×dinA \in \mathbb{R}^{r\times d_{\rm in}}, BRdout×rB \in \mathbb{R}^{d_{\rm out}\times r}, with rmin(din,dout)r \ll \min(d_{\rm in}, d_{\rm out}). This method yields a dramatic reduction in the number of trainable parameters—orders of magnitude fewer than full-weight fine-tuning—making large-scale adaptation feasible.

However, models fine-tuned with standard LoRA on small datasets or OOD inputs exhibit significant overconfidence, impairing reliability and calibration. Bayesian methods are well suited to mitigate this by quantifying posterior uncertainty, though full posterior inference is infeasible in the full network. Focusing the Bayesian treatment on LoRA parameters preserves computational scalability while providing nontrivial uncertainty estimates (Yang et al., 2023).

2. Probabilistic Modeling and Posterior Approximation

Let θRP\theta \in \mathbb{R}^P denote all concatenated LoRA parameters (PP typically a few million). Given data D={(xn,yn)}n=1N\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N, the likelihood is categorical: p(yx,θ)=Cat(y;softmax(fθ(x))),p(y \mid x, \theta) = \text{Cat}\big(y ; \operatorname{softmax}(f_\theta(x))\big), where fθ(x)RCf_\theta(x)\in\mathbb{R}^C are the logits. The prior is isotropic Gaussian p(θ)=N(θ;0,  λ1I)p(\theta) = \mathcal{N}(\theta;\,0,\;\lambda^{-1} I) with tunable regularization λ\lambda.

The associated posterior,

p(θD)p(Dθ)p(θ)=n=1Np(ynxn,θ)  N(θ;0,λ1I),p(\theta \mid \mathcal{D}) \propto p(\mathcal{D}\mid \theta) p(\theta) = \prod_{n=1}^N p(y_n|x_n,\theta)\; \mathcal{N}(\theta;\,0,\lambda^{-1} I),

is intractable for direct evaluation, motivating the use of a Laplace approximation, i.e., a Gaussian centered at the MAP estimate with precision given by the curvature of the log posterior (Yang et al., 2023).

3. Derivation of the Laplace Approximation

3.1 MAP Optimization

The MAP estimate θMAP\theta_{\rm MAP} is obtained via standard LoRA fine-tuning, maximizing the regularized log-likelihood: L(θ)=logp(Dθ)+logp(θ)=n=1Nlogp(ynxn,θ)λ2θ2+const.L(\theta) = \log p(\mathcal{D}|\theta) + \log p(\theta) = \sum_{n=1}^N \log p(y_n|x_n, \theta) - \tfrac{\lambda}{2} \|\theta\|^2 + \text{const}. Optimizers such as AdamW are typically employed.

3.2 Local Gaussian Approximation

A second-order Taylor expansion of the log-posterior around θMAP\theta_{\rm MAP} yields a quadratic form with negative Hessian HH. In practice, HH is replaced by the Generalized Gauss–Newton (GGN) or Fisher Information matrix for positive-definiteness: F=n=1NEyp(yxn,θMAP)[θlogp(yxn,θ)(θlogp(yxn,θ))].F = \sum_{n=1}^N \mathbb{E}_{y \sim p(y|x_n, \theta_{\rm MAP})} \left[ \nabla_\theta \log p(y|x_n, \theta) (\nabla_\theta \log p(y|x_n, \theta))^{\top} \right]. The approximate Gaussian posterior becomes

p(θD)N(θ;θMAP,H1),HF+λI.p(\theta|\mathcal{D}) \approx \mathcal{N}(\theta; \theta_{\rm MAP}, H^{-1}), \quad H \approx F + \lambda I.

A block-diagonal Kronecker-factored structure for each adapter layer maintains tractability (Yang et al., 2023).

4. Implementation with Kronecker Low-Rank Factorization

Standard LoRA fine-tuning is performed using frameworks such as PEFT/Transformers, yielding θMAP\theta_{\rm MAP}. The Fisher for each LoRA layer is partitioned per matrix (AA, BB), and expressed as a Kronecker product of activation and gradient covariances: F=n=1NE[a1(xn)a1(xn)]E[g(xn)g(xn)],F_\ell = \sum_{n=1}^N \mathbb{E}[a_{\ell-1}(x_n) a_{\ell-1}(x_n)^{\top}] \otimes \mathbb{E}[g_\ell(x_n) g_\ell(x_n)^{\top}], with a1a_{\ell-1} the layer input and gg_\ell the backpropagated gradients at the output.

For computational efficiency, only the large dimension factor (d×dd \times d) is stored in low-rank form using truncated SVD: XBBX \approx B B^\top for BRd×kfacB \in \mathbb{R}^{d \times k_{\rm fac}}, kfacdk_{\rm fac} \ll d. Posterior precision for a LoRA layer,

H=A1(BB)+λII,H_\ell = A^{-1} \otimes (B B^\top) + \lambda\, I \otimes I,

and determinants are efficiently evaluated via the matrix determinant lemma. The prior precision λ\lambda may be tuned either by maximizing model evidence or validation likelihood over a small holdout set, requiring only modest optimization over λ\lambda (Yang et al., 2023).

5. Posterior Predictive Inference and Calibration

Prediction uses the linearized Laplace predictive posterior:

fθ(x)fθMAP(x)+J(θθMAP),f_\theta(x) \approx f_{\theta_{\rm MAP}}(x) + J (\theta - \theta_{\rm MAP}),

where J=θfθ(x)θMAPJ = \nabla_\theta f_\theta(x) |_{\theta_{\rm MAP}}. Integrating over θ\theta yields

f(x)N(fMAP(x),Σ),Σ=JH1J.f(x) \sim \mathcal{N}(f_{\rm MAP}(x),\, \Sigma),\qquad \Sigma = J\,H^{-1} J^\top.

Monte Carlo sampling from this distribution generates logit samples f~m(x)\tilde f_m(x), which are passed through softmax and averaged to estimate calibrated predictive distributions: p^(yx)1Mm=1Msoftmax(f~m(x)).\hat p(y|x) \approx \tfrac{1}{M} \sum_{m=1}^M \operatorname{softmax}(\tilde f_m(x)). Empirical analysis finds that "MC joint" (full-covariance sampling) robustly outperforms diagonal-approximate sampling, probit, or Laplace bridge approaches (Yang et al., 2023).

6. Empirical Evaluation

Laplace-LoRA was evaluated by fine-tuning LLaMA2-7B with LoRA (r=8r=8) on six common-sense reasoning datasets (WG-S, WG-M, ARC-C, ARC-E, OBQA, BoolQ), comparing the following calibration methods:

Method Calibration Technique Notes
MAP Standard LoRA Fine-tuning Baseline
MC Dropout Dropout at Inference 10 samples
Checkpoint Ensemble Last 3 LoRA checkpoints Ensemble strategy
LoRA Ensemble 3 independent LoRA models Model diversity
LLLA Last-layer Laplace (output only) Partial Bayesian treatment
LA Laplace on all LoRA layers (Laplace-LoRA) Full Bayesian adapter uncertainty

Across all tasks, the full Laplace approximation (LA) matched standard MAP accuracy while dramatically reducing ECE (e.g., 31% to 2% on WG-S) and NLL (3.15 to 0.60). Last-layer Laplace provided partial gains, but the full treatment was consistently superior. Under out-of-distribution (ARC-C/E, MMLU), Laplace-LoRA maintained strong calibration (Tables 3–4). Temperature scaling was outperformed by Laplace-LoRA on both ECE and NLL. Diagonal Laplace variants provided limited gains, underscoring the importance of structured K-FAC.

Resource overhead is modest: memory increases by ~1–5%, time overhead ~10% (if Kronecker factors computed every 1,000 steps), with typical costs near 1% (Yang et al., 2023).

7. Limitations and Unresolved Challenges

Laplace-LoRA is a local, unimodal approximation and accordingly may fail to capture posterior multimodality or substantial nonlinearity in parameter influence. Accurate K-FAC low-rank approximation is sensitive to the choice of rank; heavy-tailed or high-intrinsic-dimension Fisher blocks may cause underestimation of uncertainty if the factorization rank is too small. Last-layer Laplace underestimates global model uncertainty, as most variance resides in lower adapter layers. Tuning the prior precision λ\lambda for best validation likelihood can yield posteriors differing from those suggested by Bayesian model evidence, indicating a tension between empirical calibration and principled Bayesian priors. More expressive approximate inference methods (e.g., variational fine-tuning) remain unexplored in this context. Nonetheless, the minimal modifications required for deployment constitute a practical strength (Yang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Laplace-LoRA.