Sparse Variational Bayesian GAMs

Updated 23 May 2026

The paper introduces a scalable framework that uses sparse variational inference with inducing points to efficiently model nonparametric additive functions via GP priors.
Sparse Variational Bayesian GAMs are defined as a method for independently modeling each additive component, ensuring tractable inference while preserving interpretability.
The approach achieves competitive predictive performance with well-calibrated uncertainty by leveraging a structured low-rank variational precision and additive decompositions.

Sparse Variational Bayesian Generalized Additive Models (GAMs) are a scalable and interpretable Bayesian framework for modeling nonparametric additive functions, leveraging Gaussian process (GP) priors and sparse variational inference. Each additive component is modeled independently with a GP, enabling rich, flexible modeling while preserving interpretability. Sparse variational approximations ensure tractable inference even in high dimensions or with large datasets by exploiting the structure of the GAM and the efficiency of inducing-point methods. This framework—detailed by Adam et al. (2018)—retains full posterior coupling among additive components, delivering well-calibrated uncertainty, interpretable function decompositions, and favorable scaling properties (Adam et al., 2018).

1. Model Formulation

The generative model considers inputs $x_n = (x_{n1},\dots,x_{nD}) \in \mathbb{R}^D$ and outputs $y_n \in \mathbb{R}$ for $n=1,\dots,N$ , with an additive predictor: $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ where each $f_d$ represents an unknown smooth function. The conditional likelihood factorizes: $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ Each additive component is assigned an independent zero-mean GP prior: $f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),$ with $\theta_d$ denoting kernel hyperparameters.

2. Sparse Gaussian Process Representation

To facilitate scalable inference, each $f_d$ is represented via $M$ inducing points $y_n \in \mathbb{R}$ 0. The associated inducing variables are

$y_n \in \mathbb{R}$ 1

and the collection $y_n \in \mathbb{R}$ 2. The joint prior over function values and inducing variables factorizes: $y_n \in \mathbb{R}$ 3 where $y_n \in \mathbb{R}$ 4 and

$y_n \in \mathbb{R}$ 5

Here, $y_n \in \mathbb{R}$ 6 is the $y_n \in \mathbb{R}$ 7 kernel matrix over inducing locations; $y_n \in \mathbb{R}$ 8 is a row vector of cross-covariances for test input $y_n \in \mathbb{R}$ 9.

3. Variational Posterior Structure

Posterior inference is addressed by variational approximation. The true posterior $n=1,\dots,N$ 0 is approximated with

$n=1,\dots,N$ 1

so that only the inducing variables $n=1,\dots,N$ 2 have free variational parameters. The variational distribution is taken as

$n=1,\dots,N$ 3

Direct fully-coupled posteriors $n=1,\dots,N$ 4 are computationally infeasible. Instead, Adam et al. parameterize the inverse covariance (precision) in an additive low-rank form: $n=1,\dots,N$ 5 with $n=1,\dots,N$ 6. This structure ensures tractable $n=1,\dots,N$ 7 storage while capturing cross-component posterior dependencies.

4. Objective and Predictive Quantities

Variational parameters are optimized by maximizing the evidence lower bound (ELBO): $n=1,\dots,N$ 8 For Gaussian likelihoods, the expectation can be computed exactly; for non-Gaussian cases, one uses quadrature or stochastic sampling.

The predictive mean and variance for input $n=1,\dots,N$ 9 are

$f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 0

$f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 1

The KL divergence term is

$f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 2

With the structured precision, this is evaluated in $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 3 time.

5. Optimization and Computational Complexity

All variational parameters $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 4 and kernel hyperparameters $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 5 are learned via (stochastic) gradient ascent on the ELBO, typically using automatic differentiation. If $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 6 factorizes over datapoints, mini-batches of size $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 7 facilitate efficient stochastic optimization.

Computational cost per iteration (full batch) is:

Storage: $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 8 for $f(x_n) = \sum_{d=1}^D f_d(x_{nd}),$ 9.
KL evaluation: $f_d$ 0, dominated by Cholesky factorization.
ELBO expectation: $f_d$ 1 (for predictive variances across all $f_d$ 2 and cross-component terms).

With mini-batching ( $f_d$ 3 data points per step, $f_d$ 4), the likelihood expectation costs $f_d$ 5 per stochastic gradient update.

6. Calibration, Uncertainty, and Interpretability

Given variational parameters $f_d$ 6, the posterior for each $f_d$ 7 is Gaussian: $f_d$ 8

$f_d$ 9

Summing the component posteriors gives the marginal over $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 0. Calibration and uncertainty quantification are conducted via posterior credible-interval coverage, CRPS, and empirical residual analysis. The additive decomposition preserves interpretable component-wise contributions to predictions.

7. Empirical Evaluation and Comparative Summary

Adam et al. demonstrate the methodology on a synthetic six-dimensional regression task: $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 1 for $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 2 and $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 3 samples with $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 4. The kernel is constructed in ANOVA fashion: $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 5 with $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 6 being centered SE kernels. Univariate kernels use $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 7 inducing points; the interaction term uses a $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 8 grid.

Empirical findings indicate the proposed method matches or betters predictive log-likelihood and RMSE versus mean-field sparse GPs (Saul et al. 2016) and fully-coupled precision-parameter GPs (Adam et al. 2017), yields well-calibrated credible intervals, and scales linearly with $p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).$ 9 and quadratically with $f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),$ 0 in the expectation step. The additive structure retains full interpretability.

By leveraging additive model structure, per-component inducing variables, and a structured low-rank variational precision ( $f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),$ 1), the approach achieves effective and interpretable Bayesian sparse GP GAM inference, with the ELBO and associated gradients computable in $f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),$ 2 time and $f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),$ 3 space, while preserving calibrated uncertainty and interpretable component-wise posterior estimates (Adam et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Scalable GAM using sparse variational Gaussian processes (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Variational Bayesian GAMs.