Linearized Laplace Approximation (LLA)

Updated 30 March 2026

LLA is a method that linearizes complex nonlinear models around the MAP estimate to construct a Gaussian approximation to the posterior for efficient Bayesian inference.
It recasts the linearized Bayesian neural network as a degenerate Gaussian process using the neural tangent kernel, enabling analytic predictive distributions.
Scalable variants such as ELLA, VaLLA, and ScaLLA address computational challenges through approximations like Nyström eigenfunctions and sparse variational methods.

The linearized Laplace approximation (LLA) is a methodology rooted in the Laplace approximation for Bayesian inference, specifically tailored to high-dimensional nonlinear models such as deep neural networks. By linearizing the model around the maximum a posteriori (MAP) solution, LLA enables efficient Gaussian uncertainty estimation while retaining scalability and interpretability. It has become central to modern approaches in Bayesian deep learning, inverse problems, and uncertainty quantification.

1. Core Principles of the Linearized Laplace Approximation

LLA constructs a Gaussian approximation to the posterior by linearizing the forward map or predictive model around the MAP estimate. For a model parameterized by $u \in \mathbb{R}^d$ (or $\theta \in \mathbb{R}^P$ in deep learning), with data $y$ and a forward operator $G(u)$ , the negative log-posterior is typically

$\Phi(u) = \frac{1}{2\epsilon} \, \|y - G(u)\|^2 + R(u),$

where $R(u)$ encodes the prior regularization or prior density. The MAP point $u_{\text{map}} = \arg\min_u \Phi(u)$ anchors the local Gaussian approximation.

The Laplace approximation replaces the posterior with the Gaussian

$\pi_L(u) = \mathcal{N}(u_{\text{map}}, H^{-1}),\quad\text{where}\;\; H = \nabla^2 \Phi(u_{\text{map}}).$

LLA, specifically, Taylor-expands $G(u)$ or the network function $f(x, \theta)$ around $u_{\text{map}}$ or $\theta^*$ to first order, yielding a locally linear surrogate. In deep learning, this is

$f_{\text{lin}}(x, \theta) = f(x, \theta^*) + J(x)(\theta - \theta^*),$

with $J(x) = \nabla_\theta f(x, \theta^*)$ the Jacobian at the MAP.

The Gaussian weight posterior propagates via this linearization, yielding a closed-form predictive distribution and recasting the problem as a Gaussian process with mean $f(x, \theta^*)$ and covariance given by the neural tangent kernel (NTK) under the Laplace covariance (Helin et al., 2020, Deng et al., 2022, Kristiadi et al., 2023).

2. Gaussian Process and Kernel Interpretations

Pivotal to LLA is the equivalence between the linearized Bayesian neural network and a degenerate Gaussian process with mean $f(x, \theta^*)$ and kernel

$k_{\text{LLA}}(x, x') = J(x) H^{-1} J(x')^T,$

or, under isotropic priors and using GGN approximations,

$k_{\text{NTK}}(x, x') = J(x) J(x')^T.$

In Bayesian deep learning, this equivalence allows the use of analytic Gaussian process prediction formulas, such as

$p(y_*|x_*, D) \approx \mathcal{N}(\mu_*, \Sigma_*), \quad \mu_* = f(x_*, \theta^*), \;\; \Sigma_* = J(x_*) H^{-1} J(x_*)^T + \text{likelihood noise}.$

Covariances between predictions at different inputs are similarly tractable.

The function-space view justifies the widespread adoption of LLA in settings where explicit Bayesian neural network sampling is computationally prohibitive, and facilitates post hoc uncertainty assessment for pre-trained models (Ortega et al., 2023, Cinquin et al., 2024).

3. Computational Approximations and Scalability

Despite its analytic tractability, LLA's computational bottlenecks are severe for high-dimensional parameter spaces or large datasets. Key issues include:

Hessian or GGN inversion: Direct inversion is $\mathcal{O}(P^3)$ in parameter dimension $P$ .
Full Jacobian storage: Computing and holding all $J(x_i)$ for all $i$ is infeasible for large $N, P$ .
Kernel matrix inversion in function space: The equivalent GP kernel gram matrix is $N \times N$ (or $NC \times NC$ in multi-output), with $\mathcal{O}(N^3)$ inversion.

Several approximation mechanisms have been proposed:

Kronecker-factored, diagonal, or last-layer GGN: Reduce storage/computation but can severely degrade uncertainty fidelity (Deng et al., 2022, Ortega et al., 2023).
Nyström and spectral low-rank approximations: ELLA uses Nyström eigenfunctions of the NTK to construct low-rank approximations of LLA covariance, allowing GP-based inference via $K\ll N$ landmark points and reducing inversion to $\mathcal{O}(K^3)$ (Deng et al., 2022).
Sparse variational GP (VaLLA): Constructs an RKHS-dual variational posterior over the linearized network's outputs, with training cost independent of $N$ and induced via minibatch SGD (Ortega et al., 2023).
Surrogate kernel learning (ScaLLA): Learns a neural surrogate whose inner product matches the NTK, requiring only Jacobian-vector products during training and enabling scalable uncertainty estimation (Ortega et al., 29 Jan 2026).
Matrix-free methods: Function-space Laplace approximations using context-point Jacobians and kernel-based inductive bias (e.g., FSP-Laplace) via highly scalable linear algebra (Cinquin et al., 2024).

The table below summarizes key scalable variants:

Variant	Approximation Method	Complexity	Key Idea
ELLA	Nyström NTK	$\mathcal{O}(K^3)$	Landmark eigendecomp.
VaLLA	Variational sparse GP	Ind. of $N$	Dual RKHS, minibatch
ScaLLA	Surrogate kernel (feat.)	Linear in $P$	JVPs, learned feature
FSP-Laplace	Matrix-free context pts	$O(Dr)$	Kernel and Jacobian

4. Error Bounds and Theoretical Guarantees

The LLA's fidelity as an approximation has been studied both asymptotically and non-asymptotically. Under suitable regularity conditions on $\Phi$ , curvature, and small nonlinearity:

For forward models $G(u) = Au + f(u)$ with $f$ small and smooth,

$d_H(\pi, \pi_L) = \mathcal{O}\big([K d^{3/2}]^{1/2}\big),$

where $K$ bounds the third derivative of $\Phi$ and $d$ is dimensionality (Helin et al., 2020).
In weakly nonlinear settings, error is proportional to nonlinearity: $d_H = \mathcal{O}(K^{1/2})$ or $\mathcal{O}(\tau^{1/2})$ for small Taylor remainder or weak nonlinear scaling $\tau f(u)$ .
Dimensionality penalizes error via factors like $\Gamma(d/2+3/2)/\Gamma(d/2)\simeq (d/2)^{3/2}$ , so LLA is reliable only when nonlinearity scales as $K \lesssim d^{-3/2}$ .
In the small noise limit ( $\epsilon\to 0$ ), classical $\mathcal{O}(\epsilon^{1/2})$ convergence is recovered (Helin et al., 2020).

Low-rank/spectral approximations introduce further error, but the spectral approximation bounds decay rapidly with landmark number or feature rank under mild kernel assumptions (Deng et al., 2022).

For deep networks, LLA's linearization ignores second-order curvature in the network's own parameterization, sometimes underestimating predictive uncertainty. Quadratic Laplace approximations (QLA) partially recover this by including rank-one Hessian-vector curvature corrections with negligible cost overhead, yielding empirically consistent but modest estimation improvements (Jiménez et al., 3 Feb 2026).

5. Extensions and Algorithmic Innovations

Several functional and algorithmic extensions to LLA have been established:

Function-space priors (FSP-Laplace): Instead of restricting to isotropic Gaussian priors in parameter space, LLA can incorporate Gaussian process priors in function space (e.g., using RBF or periodic kernels), enforcing structured inductive biases such as smoothness or periodicity directly on the predictive function, and implemented via context point Jacobians and matrix-free linear algebra (Cinquin et al., 2024).
Variance reduction and biasing for OOD detection: Surrogate kernels can be biased (block-diagonal structure) to enforce independence between in-domain and out-of-domain inputs, raising posterior variance and improving OOD detection (Ortega et al., 29 Jan 2026).
Variational LLA (VaLLA): Exploits dual sparse variational GP representations anchored at the deterministic network output (pretrained DNN), permitting stochastic minibatch optimization and matching the LLA predictive mean with minimal variance cost (Ortega et al., 2023).

A broad range of algorithmic implementations are possible, varying in Jacobian computation method (batch vs. JVP), kernel construction, and inference target (weight vs. function space).

6. Applications and Empirical Properties

LLA and its variants are established in diverse computational domains:

Bayesian deep learning: LLA provides efficient posterior approximations over trained neural networks, outperforming mean-field variational methods in calibration, negative log-likelihood (NLL), and expected calibration error (ECE) on standard benchmarks including CIFAR-10, ImageNet, and vision transformers (Deng et al., 2022, Ortega et al., 2023, Cinquin et al., 2024, Ortega et al., 29 Jan 2026).
Scientific regression: FSP-Laplace enables domain-specific priors, yielding improved test mean squared error (MSE) and evidence lower bound (ELBO) over isotropic Laplace and standard GPs when prior knowledge is abundant (Cinquin et al., 2024).
Bayesian optimization: LLA-based surrogates match or outperform standard RBF-kernel GPs in low-dimensional tasks, and exhibit strong scaling in high-dimensional problems where classical GPs collapse. However, in unbounded domains, the LLA mean (for ReLU networks) diverges linearly and variance grows unbounded, requiring bounded search domains or alternative activations for reliable Bayesian optimization (Kristiadi et al., 2023).
Uncertainty quantification: Surrogates and variational approximations enable scalable calibration of predictive confidence, robust to dataset size and architecture depth.

Empirical results demonstrate that careful approximate LLA implementations (e.g., ELLA, VaLLA, ScaLLA) match or exceed the performance of standard Laplace or mean-field variational approaches across a suite of metrics, while scaling efficiently to modern architectures.

7. Limitations and Practical Considerations

While the LLA framework is theoretically justified and practically scalable, significant caveats remain:

Curvature modeling: Pure first-order LLA neglects second-order effects from the nonlinear model component. QLA remedies this at modest cost (Jiménez et al., 3 Feb 2026).
Extrapolation pathologies: For piecewise linear activations (e.g., ReLU), LLA surrogates extrapolate linearly and can diverge outside the data domain, necessitating careful design of Bayesian optimization loops and prior selection (Kristiadi et al., 2023).
Bias-variance tradeoff in surrogate approximations: Diagonal or factorized GGN matrices significantly understate uncertainty, while low-rank or surrogate kernel approaches require careful feature selection for stability (Deng et al., 2022, Ortega et al., 29 Jan 2026).
Prior selection: The utility of function-space priors is context-dependent; incorporating strong domain knowledge substantially improves epistemic uncertainty and posterior calibration (Cinquin et al., 2024).

A plausible implication is that future research will focus on hybrid LLA schemes that blend second-order corrections, functional priors, and scalable kernel surrogates, further improving generalization, calibration, and OOD robustness in large, deep models.