Recursive Laplace Approximation
- Recursive Laplace approximation is an iterative method that refines the standard Laplace approach by constructing adaptive mixtures of Gaussian densities to capture multimodality and skewness.
- It sequentially corrects approximation errors through residual optimization and updates mixture weights using nonnegative least squares to improve accuracy.
- Empirical studies show that this methodology significantly enhances effective sample size and approximation precision in complex, high-dimensional Bayesian inference tasks.
Recursive Laplace approximation, also termed iterated (or iterLap) Laplace approximation, is a methodology for constructing adaptive mixtures of multivariate normal densities to approximate complex target probability densities , particularly in Bayesian inference contexts. By iteratively correcting the limitations of the classical single-Gaussian Laplace approximation, such as its inability to capture multimodality or skewness, recursive Laplace builds a sequence of local approximations to the residual (“unexplained”) structure in the target, ultimately yielding a tractable and highly adaptive Gaussian mixture fit (Bornkamp, 2011).
1. The Standard Laplace Approximation and Problem Motivation
The Laplace approximation seeks an analytic form for integrals involving complicated target densities , , typically normalized by . For a mode with negative definite Hessian , the local quadratic expansion gives
so exponentiation results in the Gaussian kernel , with . The normalizing constant is approximated by . Despite its computational attractiveness, this single Gaussian is inadequate for skewed or multimodal targets, motivating iterative corrections to better fit real-world Bayesian posteriors (Bornkamp, 2011).
2. Iterative Construction of the Recursive Laplace Mixture
Recursive Laplace approximation constructs a mixture of Gaussians via successive correction of the approximation error. At step , the existing mixture
is compared to by the residual
with regions where prioritized. A new component is formed by maximizing (subject to ) to locate a new mode , computing the negative Hessian (assumed negative definite), and defining . The new Gaussian is appended, and all weights are recomputed by nonnegative least squares to minimize the discrete norm
over grid points sampled from the current mixture. This process repeats until no further negative definite residual Hessians can be found, or stopping criteria (either error tolerance or small relative change in normalization) are met. The final mixture is
3. Algorithmic Structure and Implementation Details
The method is formalized as follows:
- Initialization: Identify global/local modes of and compute corresponding Hessians and covariance matrices for initial mixture components. Generate a quasi-random grid (e.g., Sobol sequence) from each component and evaluate all densities.
- Iteration:
- (A) Compute the residual .
- (B) Locate promising starting points (e.g., high regions), optimize to get new component location, and ensure the Hessian is negative definite.
- (C) Compute new covariance matrix and component.
- (D) Augment the grid with samples from the new component.
- (E) Recompute all weights via nonnegative quadratic programming.
- (F) Apply stopping criteria, based on grid error or relative normalization change.
- Output: The mixture (Bornkamp, 2011).
Recommended parameter settings include a grid size for efficiency up to –$20$, and the algorithm leverages least-squares (QP), vectorized Hessian calculations, and log-scale evaluations to mitigate underflow.
4. Mixture Structure, Weight Fitting, and Interpretative Remarks
The resulting approximation is an explicit sum of weighted normals:
with weights determined by nonnegative least squares minimization over the sampling grid, posed as a convex QP. This can be viewed as a “boosting”-like procedure, where each new component corrects the previous residual, and the fitting coefficients ensure that mass is allocated to the most poorly approximated portions of (Bornkamp, 2011).
5. Theoretical Properties and Comparative Performance
Empirically, the recursive Laplace algorithm converges in a finite number of steps, capturing major modes and skewness in the target. Once the residual is small or nonpositive everywhere, or no negative definite Hessian can be discovered, iteration halts. While no closed-form error bounds exist, each step reduces the empirical discrepancy on the grid.
Relative to alternatives:
- Single Gaussian Laplace approximations often yield poor effective sample size.
- Adaptive importance sampling or mixture-based MCMC can match performance but typically at substantially higher cost in terms of -evaluations.
- Variational methods minimize a divergence criterion rather than a direct error and may misrepresent tail probability mass (Bornkamp, 2011).
6. Computational Cost, Scalability, and Implementation Advice
Computationally, each iteration incurs costs from grid generation (), optimization for new means (typically gradient calls), Hessian computation (, potentially when sparsity is exploited), and quadratic programming for weight updates ( per iteration). Practical recommendations include:
- Employing quasi-Monte Carlo (Sobol sequences) for grid sampling.
- Operating in logarithmic space.
- Restricting optimization starting points, e.g., via -means clustering on high regions.
- Pre-scanning temperature-tempered residuals when many well-separated modes exist.
- Imposing diagonal or low-rank covariance structure if inversion is a bottleneck (Bornkamp, 2011).
7. Applications and Empirical Diagnostics
Empirical studies demonstrate the strengths of recursive Laplace approximation across several test cases:
- For a bivariate skew- density (), a single Gaussian achieves ESS , while iterLap with 9 components reaches ESS .
- For a three-mode Gaussian mixture, iterLap (5 components) yields ESS of versus for standard Laplace.
- For a 10-dimensional “banana” distribution, the method attains ESS (11 components; 19,000 -evaluations) compared to 5\% (single Laplace) or 100,000 evaluations for an adaptive IS competitor.
- In a nonlinear regression scenario (NIST ENSO data, 168 observations, 11 parameters), iterLap-IS outperforms both componentwise adaptive Gibbs and optimally tuned Metropolis MCMC in effective sample size, estimation error, and multi-variate posterior fit, using roughly half the number of -evaluations as MCMC approaches.
Typical runtimes are tractable even for high-dimensional problems, e.g., $2.5$ seconds in R for a 10D artificial example (Bornkamp, 2011). The method is implemented in the R package iterLap available on CRAN.
References
- Bornkamp, B. "Approximating Probability Densities by Iterated Laplace Approximations." Journal of Computational and Graphical Statistics, 20(4), 817–837 (2011) (Bornkamp, 2011).