Papers
Topics
Authors
Recent
2000 character limit reached

Deep Gaussian Processes

Updated 26 December 2025
  • Deep Gaussian Processes are hierarchical, nonparametric models that compose multiple layers of Gaussian processes to capture complex, non-stationary patterns.
  • They employ scalable inference methods such as variational approximations and MCMC to manage intractable integrals and propagate uncertainty across layers.
  • Applications include regression, classification, and surrogate modeling, offering superior uncertainty calibration and predictive performance on complex datasets.

Deep Gaussian Processes (DGPs) are hierarchical, nonparametric models built by composing multiple layers of Gaussian processes, providing a principled Bayesian approach to learning flexible and highly expressive distributions over functions. By stacking GPs, DGPs inherit the uncertainty calibration, implicit capacity control, and robustness of shallow GPs, but can model data with complex, non-stationary, or multimodal structure that shallow GPs cannot capture. The mathematical and algorithmic advances associated with DGPs have yielded a family of tractable inference schemes, scalable surrogates for large data, and analytic connections to both deep neural networks and classical kernel methods.

1. Mathematical Structure of Deep Gaussian Processes

A deep Gaussian process of depth LL defines a composition of random functions: f(x)=f(L)f(L1)f(1)(x)f(x) = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(1)}(x) where each layer f(l)f^{(l)} is a (vector-valued) Gaussian process, typically with independent draws per output channel: fj(l)()GP(0,kl(,)),j=1,,Dlf^{(l)}_j(\cdot) \sim \mathcal{GP}(0, k^l(\cdot, \cdot)), \quad j=1, \dots, D^l The input to the overall model is xRD0x \in \mathbb{R}^{D^0} and the top layer returns the output of interest.

For a set of NN data points X=[x1,,xN]X = [x_1,\dots,x_N], the Markovian structure implies a joint prior: p({F(l)}l=1L)=l=1Lp(F(l)F(l1))p(\{F^{(l)}\}_{l=1}^L) = \prod_{l=1}^L p(F^{(l)} | F^{(l-1)}) with F(0)=XF^{(0)} = X and F(l)=[f(l)(F1(l1)),,f(l)(FN(l1))]TF^{(l)} = [f^{(l)}(F^{(l-1)}_1), \dots, f^{(l)}(F^{(l-1)}_N)]^T. Observed outputs YY are modeled as p(YF(L))p(Y|F^{(L)}), e.g., via a Gaussian or categorical likelihood (Damianou et al., 2012, Salimbeni et al., 2017, Jakkala, 2021).

This hierarchical prior creates strong dependencies and enables highly flexible, non-linear, and non-stationary function mappings, at the expense of introducing intractable integrals for both prior marginals and posteriors.

2. Inference Schemes: Variational, MCMC, and Alternatives

Exact Bayesian inference in DGPs is analytically intractable because composing GPs breaks conjugacy and produces non-Gaussian, usually multimodal posteriors over both layer outputs and function-space values (Salimbeni et al., 2017, Havasi et al., 2018, Jakkala, 2021). Leading inference methods include:

Variational Inference (VI)

The dominant practical approach is variational inference using sparse inducing-point approximations in each layer: q({F(l),U(l)})=l=1Lp(F(l)U(l),F(l1),Z(l1))q(U(l))q(\{F^{(l)}, U^{(l)}\}) = \prod_{l=1}^L p(F^{(l)}|U^{(l)}, F^{(l-1)}, Z^{(l-1)})\, q(U^{(l)}) where q(U(l))q(U^{(l)}) is a variational Gaussian and Z(l1)Z^{(l-1)} are inducing inputs. The evidence lower bound (ELBO) is

LVI=n=1NEq(FnL)[logp(ynFnL)]l=1LKL[q(U(l))p(U(l))]\mathcal{L}_{\text{VI}} = \sum_{n=1}^N \mathbb{E}_{q(F^L_n)}[\log p(y_n|F^L_n)] - \sum_{l=1}^L \mathrm{KL}[q(U^{(l)}) \| p(U^{(l)})]

Stochastic gradient optimization and the “doubly-stochastic” trick, sampling both over minibatches and through the layer hierarchy via the reparameterization trick, enable scaling to large NN (Salimbeni et al., 2017, Damianou et al., 2012).

Expectation Propagation (EP) and SEP

Expectation propagation (and its memory-light SEP variant) iteratively approximates the joint posterior by matching moments, propagating Gaussian beliefs through each GP layer. Probabilistic backpropagation further enables efficient, layered moment-matching (Bui et al., 2015, Bui et al., 2016).

MCMC Methods

Stochastic gradient Hamiltonian Monte Carlo (SGHMC) and elliptical slice sampling provide means to sample from the true, often multimodal, DGP posterior over inducing values, overcoming limitations of unimodal variational approximations (Havasi et al., 2018).

Expressive Posteriors and Flows

Normalizing flows, especially convolutional normalizing flows, can be used to build richer variational posteriors over the set of all inducing variables and layers, capturing interlayer and within-layer correlations (Yu et al., 2021).

Amortized and Input-Dependent Inference

Amortized inference parameterizes the variational posterior with a function (often a neural network) mapping each observation to layerwise variational parameters, yielding scalable and highly expressive approximations with significantly reduced total inducing points (Meng et al., 18 Sep 2024).

Random-Feature Expansions, Markov Surrogates, and Linked GPs

DGPs can be reformulated with sparse or random basis expansions (e.g., via random Fourier features), Markov-tensor kernel bases, or by stochastically imputing latent layers and converting the hierarchy into a system of jointly trained GPs (Cutajar et al., 2016, Ding et al., 2021, Ming et al., 2021).

3. Structural Variants and Specialized DGP Architectures

Several extensions of the basic DGP architecture enable increased expressiveness or improved application-specific inductive biases:

  • Convolutional Deep GPs (CDGPs): Integrate convolutional kernels at (typically) input or shallow layers to enforce local spatial structure, critical for image modeling. Patch-based convolutional kernels yield significant gains over RBFs on vision benchmarks (Kumar et al., 2018).
  • Inter-domain DGPs: Use feature-functional inducing variables (e.g., integrating against Fourier basis), capturing global structure, improving scalability, and reducing sample complexity for non-stationary data (Rudner et al., 2020).
  • Decoupled Inducing Inputs: Separating inducing sets for mean and covariance computation per layer yields large computational savings at negligible or improved empirical performance (Havasi et al., 2018).
  • Multi-fidelity Conditional DGPs: Condition first-layer GPs directly on low-fidelity observations, propagating this uncertainty through the hierarchy via analytic (moment-matched) effective kernels (Lu et al., 2020).
  • Gradient-enhanced DGPs: Integrate both observed and inferred derivatives throughout layers. This is implemented with chain-rule propagation of GP and derivative predictions and enables accurate surrogate modeling for nonstationary simulators (Booth, 19 Dec 2025).
  • Random Feature DGPs, Markov Expansions: Reparameterize layers as neural networks with random/structured bases for scalability and approximate inference (Cutajar et al., 2016, Ding et al., 2021).

4. Learning, Scalability, and Computational Aspects

The computational cost for DGP inference is dominated by the number and structure of inducing points/features and the cost of layerwise kernel operations. Key factors include:

  • Inducing-Point Variational DGP: Each layer with MM inducing points and DD outputs: O(DNM2+M3)O(D N M^2 + M^3) per layer/iteration (Salimbeni et al., 2017, Havasi et al., 2018).
  • Decoupled/Amortized Approaches: Using larger sets for means and smaller for variances, or local input-dependent amortization, enables O(DNMa+DNMb2+Mb3)O(D N M_a + D N M_b^2 + M_b^3), or even O(M3)O(M^3) per mini-batch when amortization is used (Havasi et al., 2018, Meng et al., 18 Sep 2024).
  • Vecchia, Markov, and Sparse Surrogates: For large NN, approximations based on conditional independence—Vecchia, Markov-tensor, or hierarchical bases—achieve O(Nm2)O(N m^2) or polylog(M)\mathrm{poly}\log(M) per sample, enabling fully Bayesian DGP inference at scale (Sauer et al., 2022, Ding et al., 2021).
  • Random Feature DGPs: GPUs and efficient matrix multiplications yield scalability to millions of points and deep (30+) layers (Cutajar et al., 2016).
  • SGHMC/Flow-based Techniques: These methods require careful tuning but can offer superior performance on complex, multimodal posteriors at the cost of increased per-sample time (Havasi et al., 2018, Yu et al., 2021).

5. Expressivity, Uncertainty, and Theoretical Properties

Expressivity Parameters and Non-Gaussianity

The composition of kernels in DGPs induces nonstationary, heavy-tailed, and multi-scale behavior in the effective prior covariance: even “homogeneous” deep kernels (e.g., multiple SE layers) do not collapse to a shallow GP. The effective kernel can be expressed via closed-form integration for small LL or recursively for deep hierarchies, admitting rigorous characterization of marginal prior moments and their relation to NN activations (Lu et al., 2019). Deeper DGPs can exhibit non-decaying long-range covariance (heavy tails) and sharp transitions between regularity regimes, governed by layerwise signal-to-lengthscale ratios.

Uncertainty Calibration

DGPs provide automatically-calibrated uncertainty at all layers; empirical runs on regression, classification, density estimation, and surrogate modeling show that DGPs yield predictive uncertainties that are systematically better calibrated than shallow GPs or modern BNNs, especially for non-stationary or heteroscedastic targets (Bui et al., 2015, Bui et al., 2016, Ming et al., 2021, Booth, 19 Dec 2025).

Interpretation as Deep NNs and Approximate Single-Layer GPs

Moment-matched DGPs are equivalent to infinite-width, deep Bayesian neural networks with composed activation functions. The effective kernel at any depth corresponds to a single layer with a highly composite activation, revealing connections to classical results for neural tangent kernels and kernel mean embeddings (Lu et al., 2019).

6. Empirical Performance and Application Domains

DGPs are consistently shown to outperform shallow GPs, classical neural networks, and shallow variational approximations on a range of benchmarks:

  • Regression: DGPs achieve lower RMSE and higher test log likelihood than sparse GPs, Bayesian neural nets, or deterministic DNNs on UCI, molecular, and physical process datasets. Multi-scale structure or sharp transitions are accurately captured only by DGPs (Bui et al., 2015, Bui et al., 2016, Lu et al., 2019).
  • Classification: DGPs and convolutional DGPs outperform both GP and CNN baselines on MNIST, CIFAR-10, and Caltech101 (Kumar et al., 2018).
  • Large Scale “Big Data”: DGPs with stochastic, variational, or random-feature inference scale to N>106N>10^6 and depths L>5L>5, maintaining or improving predictive performance relative to shallow or kernel-learned GPs (Salimbeni et al., 2017, Cutajar et al., 2016, Sauer et al., 2022).
  • Computer Experiments and Surrogates: Fully Bayesian and SI-based DGP surrogates provide accurate uncertainty quantification and efficient emulation for expensive, nonstationary simulators; conditional and gradient-enhanced DGPs further enhance interpretability and practical utility (Ming et al., 2021, Booth, 19 Dec 2025, Sauer et al., 2022, Lu et al., 2020).

7. Open Problems, Frontiers, and Future Directions

Open research directions for DGPs include:

DGPs offer a principled, hierarchical, and highly expressive class of models uniting Bayesian nonparametrics with deep representation learning. The ongoing development of tractable, scalable inference and richer prior/posterior constructions continues to expand both the practical and theoretical reach of this field.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Deep Gaussian Processes (DGPs).