Laplace-LoRA: Bayesian Uncertainty in LoRA

Updated 6 February 2026

Laplace-LoRA is a Bayesian procedure that applies Laplace approximation to estimate a Gaussian posterior over low-rank adapter parameters in LLMs.
It leverages efficient low-dimensional inference and Kronecker-factored representations to significantly reduce overconfidence and calibration error on small or out-of-distribution datasets.
The method combines MAP optimization and structured posterior estimation to deliver calibrated predictive distributions with only modest computational overhead.

Laplace-LoRA is a post-hoc Bayesian procedure for uncertainty quantification in low-rank adaptation (LoRA) of LLMs. It leverages the Laplace approximation to estimate a Gaussian posterior over LoRA adapter parameters, providing robust calibration for fine-tuned LLMs with minimal computational overhead. Laplace-LoRA combines efficient low-dimensional Bayesian inference with Kronecker-factored representations, significantly reducing overconfidence and expected calibration error (ECE), particularly when models are adapted on small or out-of-distribution datasets (Yang et al., 2023).

1. LoRA Parametrization and Motivation for Bayesian Uncertainty

LLMs contain large affine weight matrices $W_0\in\mathbb{R}^{d_{\rm out}\times d_{\rm in}}$ . LoRA introduces an efficient adaptation by learning an additive low-rank update to these matrices:

$\mathbf{h} = W_0\,a + \Delta W\,a,\qquad \Delta W = B\,A,$

where $A \in \mathbb{R}^{r\times d_{\rm in}}$ , $B \in \mathbb{R}^{d_{\rm out}\times r}$ , with $r \ll \min(d_{\rm in}, d_{\rm out})$ . This method yields a dramatic reduction in the number of trainable parameters—orders of magnitude fewer than full-weight fine-tuning—making large-scale adaptation feasible.

However, models fine-tuned with standard LoRA on small datasets or OOD inputs exhibit significant overconfidence, impairing reliability and calibration. Bayesian methods are well suited to mitigate this by quantifying posterior uncertainty, though full posterior inference is infeasible in the full network. Focusing the Bayesian treatment on LoRA parameters preserves computational scalability while providing nontrivial uncertainty estimates (Yang et al., 2023).

2. Probabilistic Modeling and Posterior Approximation

Let $\theta \in \mathbb{R}^P$ denote all concatenated LoRA parameters ( $P$ typically a few million). Given data $\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N$ , the likelihood is categorical: $p(y \mid x, \theta) = \text{Cat}\big(y ; \operatorname{softmax}(f_\theta(x))\big),$ where $f_\theta(x)\in\mathbb{R}^C$ are the logits. The prior is isotropic Gaussian $p(\theta) = \mathcal{N}(\theta;\,0,\;\lambda^{-1} I)$ with tunable regularization $\lambda$ .

The associated posterior,

$p(\theta \mid \mathcal{D}) \propto p(\mathcal{D}\mid \theta) p(\theta) = \prod_{n=1}^N p(y_n|x_n,\theta)\; \mathcal{N}(\theta;\,0,\lambda^{-1} I),$

is intractable for direct evaluation, motivating the use of a Laplace approximation, i.e., a Gaussian centered at the MAP estimate with precision given by the curvature of the log posterior (Yang et al., 2023).

3. Derivation of the Laplace Approximation

3.1 MAP Optimization

The MAP estimate $\theta_{\rm MAP}$ is obtained via standard LoRA fine-tuning, maximizing the regularized log-likelihood: $L(\theta) = \log p(\mathcal{D}|\theta) + \log p(\theta) = \sum_{n=1}^N \log p(y_n|x_n, \theta) - \tfrac{\lambda}{2} \|\theta\|^2 + \text{const}.$ Optimizers such as AdamW are typically employed.

3.2 Local Gaussian Approximation

A second-order Taylor expansion of the log-posterior around $\theta_{\rm MAP}$ yields a quadratic form with negative Hessian $H$ . In practice, $H$ is replaced by the Generalized Gauss–Newton (GGN) or Fisher Information matrix for positive-definiteness: $F = \sum_{n=1}^N \mathbb{E}_{y \sim p(y|x_n, \theta_{\rm MAP})} \left[ \nabla_\theta \log p(y|x_n, \theta) (\nabla_\theta \log p(y|x_n, \theta))^{\top} \right].$ The approximate Gaussian posterior becomes

$p(\theta|\mathcal{D}) \approx \mathcal{N}(\theta; \theta_{\rm MAP}, H^{-1}), \quad H \approx F + \lambda I.$

A block-diagonal Kronecker-factored structure for each adapter layer maintains tractability (Yang et al., 2023).

4. Implementation with Kronecker Low-Rank Factorization

Standard LoRA fine-tuning is performed using frameworks such as PEFT/Transformers, yielding $\theta_{\rm MAP}$ . The Fisher for each LoRA layer is partitioned per matrix ( $A$ , $B$ ), and expressed as a Kronecker product of activation and gradient covariances: $F_\ell = \sum_{n=1}^N \mathbb{E}[a_{\ell-1}(x_n) a_{\ell-1}(x_n)^{\top}] \otimes \mathbb{E}[g_\ell(x_n) g_\ell(x_n)^{\top}],$ with $a_{\ell-1}$ the layer input and $g_\ell$ the backpropagated gradients at the output.

For computational efficiency, only the large dimension factor ( $d \times d$ ) is stored in low-rank form using truncated SVD: $X \approx B B^\top$ for $B \in \mathbb{R}^{d \times k_{\rm fac}}$ , $k_{\rm fac} \ll d$ . Posterior precision for a LoRA layer,

$H_\ell = A^{-1} \otimes (B B^\top) + \lambda\, I \otimes I,$

and determinants are efficiently evaluated via the matrix determinant lemma. The prior precision $\lambda$ may be tuned either by maximizing model evidence or validation likelihood over a small holdout set, requiring only modest optimization over $\lambda$ (Yang et al., 2023).

5. Posterior Predictive Inference and Calibration

Prediction uses the linearized Laplace predictive posterior:

$f_\theta(x) \approx f_{\theta_{\rm MAP}}(x) + J (\theta - \theta_{\rm MAP}),$

where $J = \nabla_\theta f_\theta(x) |_{\theta_{\rm MAP}}$ . Integrating over $\theta$ yields

$f(x) \sim \mathcal{N}(f_{\rm MAP}(x),\, \Sigma),\qquad \Sigma = J\,H^{-1} J^\top.$

Monte Carlo sampling from this distribution generates logit samples $\tilde f_m(x)$ , which are passed through softmax and averaged to estimate calibrated predictive distributions: $\hat p(y|x) \approx \tfrac{1}{M} \sum_{m=1}^M \operatorname{softmax}(\tilde f_m(x)).$ Empirical analysis finds that "MC joint" (full-covariance sampling) robustly outperforms diagonal-approximate sampling, probit, or Laplace bridge approaches (Yang et al., 2023).

6. Empirical Evaluation

Laplace-LoRA was evaluated by fine-tuning LLaMA2-7B with LoRA ( $r=8$ ) on six common-sense reasoning datasets (WG-S, WG-M, ARC-C, ARC-E, OBQA, BoolQ), comparing the following calibration methods:

Method	Calibration Technique	Notes
MAP	Standard LoRA Fine-tuning	Baseline
MC Dropout	Dropout at Inference	10 samples
Checkpoint Ensemble	Last 3 LoRA checkpoints	Ensemble strategy
LoRA Ensemble	3 independent LoRA models	Model diversity
LLLA	Last-layer Laplace (output only)	Partial Bayesian treatment
LA	Laplace on all LoRA layers (Laplace-LoRA)	Full Bayesian adapter uncertainty

Across all tasks, the full Laplace approximation (LA) matched standard MAP accuracy while dramatically reducing ECE (e.g., 31% to 2% on WG-S) and NLL (3.15 to 0.60). Last-layer Laplace provided partial gains, but the full treatment was consistently superior. Under out-of-distribution (ARC-C/E, MMLU), Laplace-LoRA maintained strong calibration (Tables 3–4). Temperature scaling was outperformed by Laplace-LoRA on both ECE and NLL. Diagonal Laplace variants provided limited gains, underscoring the importance of structured K-FAC.

Resource overhead is modest: memory increases by ~1–5%, time overhead ~10% (if Kronecker factors computed every 1,000 steps), with typical costs near 1% (Yang et al., 2023).

7. Limitations and Unresolved Challenges

Laplace-LoRA is a local, unimodal approximation and accordingly may fail to capture posterior multimodality or substantial nonlinearity in parameter influence. Accurate K-FAC low-rank approximation is sensitive to the choice of rank; heavy-tailed or high-intrinsic-dimension Fisher blocks may cause underestimation of uncertainty if the factorization rank is too small. Last-layer Laplace underestimates global model uncertainty, as most variance resides in lower adapter layers. Tuning the prior precision $\lambda$ for best validation likelihood can yield posteriors differing from those suggested by Bayesian model evidence, indicating a tension between empirical calibration and principled Bayesian priors. More expressive approximate inference methods (e.g., variational fine-tuning) remain unexplored in this context. Nonetheless, the minimal modifications required for deployment constitute a practical strength (Yang et al., 2023).

Markdown Upgrade to Chat

References (1)

Bayesian Low-rank Adaptation for Large Language Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Laplace-LoRA.