Deep Surrogate Models with Gaussian Processes

Updated 16 May 2026

Deep surrogate models with Gaussian Processes are hierarchical frameworks that compose multiple GP layers to capture complex, nonlinear, and nonstationary behaviors.
They employ techniques like variational inference and MCMC, including sparse approximations and Vecchia methods, to efficiently learn from large-scale simulation data.
These models are widely applied in Bayesian optimization, likelihood-free inference, and active design, often outperforming single-layer GPs in handling multimodal and high-dimensional challenges.

Deep surrogate models with Gaussian Processes (GPs) denote a class of surrogate modeling frameworks in which multiple GP layers are composed to capture hierarchical, highly nonlinear, and nonstationary mappings between inputs and outputs in computer experiments and simulation-driven inference tasks. These deep Gaussian process (DGP) surrogates have demonstrated marked advantages in capturing multimodal, heteroscedastic, or abrupt behavior, often with superior uncertainty quantification and active learning potential compared to conventional single-layer GPs. Their development is closely linked to the requirements of Bayesian optimization, likelihood-free inference, large-scale simulation campaigns, and structured prediction in scientific computing.

1. Mathematical Foundations and Model Construction

The core principle of DGP surrogates is the functional composition of GPs across several layers. Consider a model of depth $L$ mapping $x \in \mathbb{R}^d$ to output $y$ :

$f(x) = f_L(f_{L-1}(\cdots f_1(x) \cdots))$

where each $f_\ell$ is a GP defined on the output space of the previous layer, typically with its own stationary kernel $k^{(\ell)}(\cdot,\cdot)$ . This recursive architecture produces a highly flexible prior over functions whose induced marginal is typically non-Gaussian, capable of modeling nonstationary, heteroscedastic, or multimodal behaviors via the nonlinear interaction of layerwise GPs (Booth et al., 2023).

Layer architectures vary. The standard approach utilizes independent GPs per output dimension at each layer, combined with RBF or ARD kernels. There are also specialized variants:

Latent variable DGPs: Augment inputs at each layer with stochastic latent variables, e.g., $w \sim \mathcal{N}(0,1)$ , to induce additional marginal flexibility and facilitate multimodal surrogates, as in “LV-2GP” for Bayesian optimization in likelihood-free inference (BOLFI) (Aushev et al., 2020).
Locally linear projections: Employed in Deep Jump Gaussian Processes (DJGP) for piecewise continuous functions, where a region-specific $W_j$ (local projection) is learned by a GP prior to extract local active subspaces, followed by a “jump” GP for regime-sensitive modeling (Xu et al., 24 Oct 2025).
Multi-output hierarchies: The Deep Intrinsic Coregionalization Model GP (deepICMGP) embeds coregionalization matrices within each layer so as to model both nonlinear warpings and output-wise dependencies, critical for vector-structured outputs (Chang et al., 22 Aug 2025).

2. Variational Inference, MCMC, and Optimization

Exact Bayesian inference in DGPs is analytically intractable owing to the non-Gaussian, high-dimensional latent structure. Two main inference paradigms dominate:

Sparse variational inference (SVI): For each layer $\ell$ , $m_\ell$ inducing points $x \in \mathbb{R}^d$ 0, with associated function values $x \in \mathbb{R}^d$ 1, are introduced; the variational posterior posits $x \in \mathbb{R}^d$ 2. Doubly-stochastic variational methods (DSVI) sample through the composition using the reparameterization trick, optimizing an ELBO comprising expected log-likelihood and KL divergences per layer (Hebbal et al., 2019, Yazdi et al., 2024, Rudner et al., 2020, Ding et al., 2021).
MCMC with Vecchia/ESS acceleration: For small- to moderate- $x \in \mathbb{R}^d$ 3 problems or where full UQ is essential, MCMC can sample layerwise latent variables and hyperparameters, leveraging the Vecchia approximation to reduce cubic operations—forming block-sparse Cholesky factorizations and univariate conditionals for each row. Elliptical slice sampling (ESS) enables efficient proposals for latent GPs under Gaussian priors (Sauer et al., 2022, Sauer et al., 2020, Booth et al., 2023).

Optimization schedules in SVI typically alternate natural-gradient updates (for variational means/variances of the final GP) with Adam or SGD steps on hyperparameters and inducing inputs (Hebbal et al., 2019, Yazdi et al., 2024). For quantile-conditioning in multimodal settings, IWVI (importance-weighted VI) is adopted for tighter bounds (Aushev et al., 2020).

3. Predictive Uncertainty and Hierarchical Propagation

The predictive posterior of a DGP does not remain Gaussian due to nonlinear propagation through random layers. At test input $x \in \mathbb{R}^d$ 4:

For SVI/Doubly-stochastic inference: Multiple ( $x \in \mathbb{R}^d$ 5) draws are propagated through each layer, approximating the posterior predictive mean and variance as empirical estimates over these draws.
For fully Bayesian (MCMC) approaches: Predictive samples are averaged over MCMC draws, often with latent layer and hyperparameter samples, employing the law of total variance (Sauer et al., 2022, Sauer et al., 2020).

Through this composition, hierarchical uncertainty accumulates: uncertainty in the latent GPs at lower layers amplifies or transforms through each subsequent GP. In particular, DGP marginals can become strongly non-Gaussian—exhibiting multimodal, heavy-tailed, or skewed distributions. To maintain principled acquisition strategies for active learning or Bayesian Optimization, quantile-conditioning on the predictive samples, rather than relying on mean-variance approximations, has been shown to yield superior exploration-exploitation trade-offs in multimodal or skewed targets (Aushev et al., 2020).

4. Surrogate Integration for Optimization and Inverse Problems

DGP surrogates are particularly prominent in:

Bayesian Optimization (BO): DGPs replace stationary GPs as the BO surrogate, with tailored acquisition rules such as quantile-conditioned lower confidence bounds (LCB) or expected improvement computed from DGP posterior samples (Aushev et al., 2020, Hebbal et al., 2019).
Likelihood-Free Inference (BOLFI): DGP surrogates capture multimodal posteriors and irregular discrepancy landscapes, reducing simulator calls required for accurate posterior approximation (e.g., Wasserstein distance benchmarks on BDM, TE1–3, NW tasks) (Aushev et al., 2020).
Active design and sequential experimental design: DGPs, via information-based acquisition functions (e.g., ALC), direct simulator evaluation to non-uniformly cover regions of epistemic uncertainty, outperforming stationary GPs especially on regime-changing domains (Sauer et al., 2020, Chang et al., 22 Aug 2025).
High-dimensional, piecewise, and categorical surrogates: DJGP provides reliable UQ and accuracy on discontinuous targets in high- $x \in \mathbb{R}^d$ 6, and “deep” GP surrogates with warping layers extend to categorical or binary outputs (Xu et al., 24 Oct 2025, Cooper et al., 24 Jan 2025).

5. Scalability and Algorithmic Efficiency

Traditional DGPs incur $x \in \mathbb{R}^d$ 7 costs per layer due to covariance inversions. Several strategies have been introduced to enhance computational tractability:

Method	Computational Complexity	Scale Feasible
Inducing-point VI	$x \in \mathbb{R}^d$ 8 (with $x \in \mathbb{R}^d$ 9 per layer)	Large ( $y$ 0)
Vecchia MCMC/ESS	$y$ 1 with $y$ 2 fixed	$y$ 3
Hierarchical Expansion (DTMGP)	$y$ 4 forward/backward	$y$ 5, $y$ 6 large

Vecchia Approximation: Enabling exact or MCMC-based inference for $y$ 7, as in the deepgp R package, with empirical performance matching full DGP and outpacing inducing-point VI in RMSE, CRPS, and time (Sauer et al., 2022, Ding et al., 2021).
Sparse tensor Markov expansion: DTMGP leverages Markov structure for hierarchical expansions, so that forward/inverse operations scale only as $y$ 8, permitting rapid training and prediction for high-dimensional computer models (Ding et al., 2021).
Mini-batch SVI and cross-layer parameter sharing: Used in large-scale BO, simulation, or categorical surrogate settings (Yazdi et al., 2024, Cooper et al., 24 Jan 2025).

6. Empirical Benchmarks and Case Studies

Extensive experimental validation confirms the superiority of DGP surrogates in nonstationary and multimodal regimes:

Likelihood-free inference: DGP surrogates outperform GPs in bimodal (TE2, NW) settings with median scaled Wasserstein distance as low as $y$ 9, compared to GP at $f(x) = f_L(f_{L-1}(\cdots f_1(x) \cdots))$ 0 (Aushev et al., 2020).
Bayesian optimization tasks: Two- and three-layer DGPs reach lower regret and sub-optimality (e.g., Trid-10d, Hartmann-6d, aerospace booster), achieving solutions with typically half the number of function calls as stationary GP surrogates (Hebbal et al., 2019).
Large-scale surrogates (satellite drag, COMPAS BBH): Vecchia-approximate DGPs scale to $f(x) = f_L(f_{L-1}(\cdots f_1(x) \cdots))$ 1 with UQ and RMSE outperforming variational DGP and local GPs (Sauer et al., 2022, Yazdi et al., 2024).
Piecewise and discontinuous problems: DJGP outperforms global DGP and preprojected JGP in both RMSE and CRPS across real and synthetic high- $f(x) = f_L(f_{L-1}(\cdots f_1(x) \cdots))$ 2 benchmarks (Xu et al., 24 Oct 2025).
Multi-output modeling: deepICMGP achieves top-three performance for RMSE and CRPS across 8 synthetic and industrial benchmarks, with superior multivariate uncertainty quantification (Chang et al., 22 Aug 2025).

7. Application Considerations and Current Limitations

While DGP surrogates offer enhanced flexibility and modeling fidelity, several caveats persist:

Inference complexity: MCMC-based DGPs offer superior UQ at higher computational cost; variational methods may under-quantify uncertainty when multimodal posteriors are present (Booth et al., 2023).
Model selection: Overly deep or flexible DGPs can overfit small data regimes unless regularizing priors or smoothness penalties are used (Yazdi et al., 2024).
Scalability: While inducing-point, Vecchia, and sparse-tensor approaches mitigate cubic bottlenecks, there is still a tradeoff between computational cost, fidelity, and uncertainty propagation, particularly as dataset size and output dimension increase (Ding et al., 2021, Sauer et al., 2022).

Emerging trends include integrating gradient observations into DGP frameworks for enhanced surrogate fidelity (critical in scientific computation), rigorous treatment of high-dimensional and multi-task regimes, and unified modeling for categorical or structured outputs using warping extensions (Booth, 19 Dec 2025, Cooper et al., 24 Jan 2025, Chang et al., 22 Aug 2025).

References

"Likelihood-Free Inference with Deep Gaussian Processes" (Aushev et al., 2020)
"Bayesian Optimization using Deep Gaussian Processes" (Hebbal et al., 2019)
"Vecchia-approximated Deep Gaussian Processes for Computer Experiments" (Sauer et al., 2022)
"Deep Jump Gaussian Processes for Surrogate Modeling of High-Dimensional Piecewise Continuous Functions" (Xu et al., 24 Oct 2025)
"Deep Gaussian Processes with Gradients" (Booth, 19 Dec 2025)
"Modernizing full posterior inference for surrogate modeling of categorical-output simulation experiments" (Cooper et al., 24 Jan 2025)
"Deep Intrinsic Coregionalization Multi-Output Gaussian Process Surrogate with Active Learning" (Chang et al., 22 Aug 2025)
"Nonstationary Gaussian Process Surrogates" (Booth et al., 2023)
"Deep Gaussian Process Emulation and Uncertainty Quantification for Large Computer Experiments" (Yazdi et al., 2024)
"A Sparse Expansion For Deep Gaussian Processes" (Ding et al., 2021)
"Inter-domain Deep Gaussian Processes" (Rudner et al., 2020)
"Active Learning for Deep Gaussian Process Surrogates" (Sauer et al., 2020)
"Gaussian process regression + deep neural network autoencoder for probabilistic surrogate modeling in nonlinear mechanics of solids" (Deshpande et al., 2024)