Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Variational Bayesian GAMs

Updated 23 May 2026
  • The paper introduces a scalable framework that uses sparse variational inference with inducing points to efficiently model nonparametric additive functions via GP priors.
  • Sparse Variational Bayesian GAMs are defined as a method for independently modeling each additive component, ensuring tractable inference while preserving interpretability.
  • The approach achieves competitive predictive performance with well-calibrated uncertainty by leveraging a structured low-rank variational precision and additive decompositions.

Sparse Variational Bayesian Generalized Additive Models (GAMs) are a scalable and interpretable Bayesian framework for modeling nonparametric additive functions, leveraging Gaussian process (GP) priors and sparse variational inference. Each additive component is modeled independently with a GP, enabling rich, flexible modeling while preserving interpretability. Sparse variational approximations ensure tractable inference even in high dimensions or with large datasets by exploiting the structure of the GAM and the efficiency of inducing-point methods. This framework—detailed by Adam et al. (2018)—retains full posterior coupling among additive components, delivering well-calibrated uncertainty, interpretable function decompositions, and favorable scaling properties (Adam et al., 2018).

1. Model Formulation

The generative model considers inputs xn=(xn1,,xnD)RDx_n = (x_{n1},\dots,x_{nD}) \in \mathbb{R}^D and outputs ynRy_n \in \mathbb{R} for n=1,,Nn=1,\dots,N, with an additive predictor: f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}), where each fdf_d represents an unknown smooth function. The conditional likelihood factorizes: p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}). Each additive component is assigned an independent zero-mean GP prior: fdGP(0,kd(,;θd)),f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)), with θd\theta_d denoting kernel hyperparameters.

2. Sparse Gaussian Process Representation

To facilitate scalable inference, each fdf_d is represented via MM inducing points ynRy_n \in \mathbb{R}0. The associated inducing variables are

ynRy_n \in \mathbb{R}1

and the collection ynRy_n \in \mathbb{R}2. The joint prior over function values and inducing variables factorizes: ynRy_n \in \mathbb{R}3 where ynRy_n \in \mathbb{R}4 and

ynRy_n \in \mathbb{R}5

Here, ynRy_n \in \mathbb{R}6 is the ynRy_n \in \mathbb{R}7 kernel matrix over inducing locations; ynRy_n \in \mathbb{R}8 is a row vector of cross-covariances for test input ynRy_n \in \mathbb{R}9.

3. Variational Posterior Structure

Posterior inference is addressed by variational approximation. The true posterior n=1,,Nn=1,\dots,N0 is approximated with

n=1,,Nn=1,\dots,N1

so that only the inducing variables n=1,,Nn=1,\dots,N2 have free variational parameters. The variational distribution is taken as

n=1,,Nn=1,\dots,N3

Direct fully-coupled posteriors n=1,,Nn=1,\dots,N4 are computationally infeasible. Instead, Adam et al. parameterize the inverse covariance (precision) in an additive low-rank form: n=1,,Nn=1,\dots,N5 with n=1,,Nn=1,\dots,N6. This structure ensures tractable n=1,,Nn=1,\dots,N7 storage while capturing cross-component posterior dependencies.

4. Objective and Predictive Quantities

Variational parameters are optimized by maximizing the evidence lower bound (ELBO): n=1,,Nn=1,\dots,N8 For Gaussian likelihoods, the expectation can be computed exactly; for non-Gaussian cases, one uses quadrature or stochastic sampling.

The predictive mean and variance for input n=1,,Nn=1,\dots,N9 are

f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),0

f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),1

The KL divergence term is

f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),2

With the structured precision, this is evaluated in f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),3 time.

5. Optimization and Computational Complexity

All variational parameters f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),4 and kernel hyperparameters f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),5 are learned via (stochastic) gradient ascent on the ELBO, typically using automatic differentiation. If f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),6 factorizes over datapoints, mini-batches of size f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),7 facilitate efficient stochastic optimization.

Computational cost per iteration (full batch) is:

  • Storage: f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),8 for f(xn)=d=1Dfd(xnd),f(x_n) = \sum_{d=1}^D f_d(x_{nd}),9.
  • KL evaluation: fdf_d0, dominated by Cholesky factorization.
  • ELBO expectation: fdf_d1 (for predictive variances across all fdf_d2 and cross-component terms).

With mini-batching (fdf_d3 data points per step, fdf_d4), the likelihood expectation costs fdf_d5 per stochastic gradient update.

6. Calibration, Uncertainty, and Interpretability

Given variational parameters fdf_d6, the posterior for each fdf_d7 is Gaussian: fdf_d8

fdf_d9

Summing the component posteriors gives the marginal over p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).0. Calibration and uncertainty quantification are conducted via posterior credible-interval coverage, CRPS, and empirical residual analysis. The additive decomposition preserves interpretable component-wise contributions to predictions.

7. Empirical Evaluation and Comparative Summary

Adam et al. demonstrate the methodology on a synthetic six-dimensional regression task: p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).1 for p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).2 and p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).3 samples with p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).4. The kernel is constructed in ANOVA fashion: p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).5 with p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).6 being centered SE kernels. Univariate kernels use p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).7 inducing points; the interaction term uses a p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).8 grid.

Empirical findings indicate the proposed method matches or betters predictive log-likelihood and RMSE versus mean-field sparse GPs (Saul et al. 2016) and fully-coupled precision-parameter GPs (Adam et al. 2017), yields well-calibrated credible intervals, and scales linearly with p(ynf(xn))=p(ynρn),ρn=d=1Dfd(xnd).p(y_n \mid f(x_n)) = p(y_n \mid \rho_n), \quad \rho_n = \sum_{d=1}^D f_d(x_{nd}).9 and quadratically with fdGP(0,kd(,;θd)),f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),0 in the expectation step. The additive structure retains full interpretability.

By leveraging additive model structure, per-component inducing variables, and a structured low-rank variational precision (fdGP(0,kd(,;θd)),f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),1), the approach achieves effective and interpretable Bayesian sparse GP GAM inference, with the ELBO and associated gradients computable in fdGP(0,kd(,;θd)),f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),2 time and fdGP(0,kd(,;θd)),f_d \sim \mathcal{GP}(0, k_d(\cdot, \cdot; \theta_d)),3 space, while preserving calibrated uncertainty and interpretable component-wise posterior estimates (Adam et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Variational Bayesian GAMs.