Non-Stationary Gaussian Processes

Updated 29 October 2025

Non-Stationary Gaussian Processes are stochastic models with covariance functions that vary over input space, capturing local changes in variance, correlation, and periodicity.
They encompass diverse methodologies such as parametric kernels, deep kernel learning, deep GPs, kernel warping, and partition models to balance flexibility with interpretability.
Advances in scalability, including sparse approximations and variational methods, enable efficient application of non-stationary GPs to large-scale scientific and engineering problems.

A non-stationary Gaussian process (GP) is a stochastic process whose covariance structure varies across the input space, thereby capturing phenomena whose statistical properties (e.g., variance, correlation length, or periodicity) shift spatially or temporally. This stands in contrast to stationary GPs, where the covariance depends only on the separation between inputs and is invariant to absolute position. Non-stationary GPs provide a principled, probabilistic modeling approach for functions and fields exhibiting heterogeneous, locally adaptive behavior, and underpin advanced surrogate modeling, forecasting, and uncertainty quantification across a diverse range of scientific and engineering domains.

1. Classes and Methodologies for Non-Stationary Gaussian Processes

Non-stationary GPs are realized through several principal mechanisms, each offering different trade-offs in flexibility, interpretability, and computational cost:

Parametric Non-Stationary Kernels: These models augment classical stationary kernels by allowing kernel parameters, such as signal variance $g(\mathbf{x})$ or lengthscale, to vary over input space through parametric functions—often basis function expansions (e.g., sums of RBFs). The general formulation is

$k(\mathbf{x}_i, \mathbf{x}_j) = \sum_{d=1}^N g_d(\mathbf{x}_i) g_d(\mathbf{x}_j) k_\mathrm{stat}(|\mathbf{x}_i - \mathbf{x}_j|),$

thereby inducing position-dependent amplitude and (optionally) smoothness. These are interpretable and moderately flexible but hyperparameter selection can become challenging as model complexity grows (Noack et al., 2023).

Deep Kernel Learning: Here, a stationary kernel is applied not on the original inputs but on their embedding $\boldsymbol{\phi}(\mathbf{x})$ generated by a trainable neural network:

$k(\mathbf{x}_i, \mathbf{x}_j) = k_\mathrm{stat}(\|\boldsymbol{\phi}(\mathbf{x}_i) - \boldsymbol{\phi}(\mathbf{x}_j)\|).$

This approach achieves high flexibility—adapting, for instance, both variance and correlation structure—but at the expense of interpretability and a greater risk of overfitting or model misspecification, especially in moderate data regimes (Noack et al., 2023).

Deep Gaussian Processes (DGPs): By stacking GPs such that each layer's outputs serve as the inputs to the next, DGPs realize highly non-stationary and nonparametric representations:

$f^{(l)}(\mathbf{x}) \sim GP(m^{(l)}(\cdot), k^{(l)}(\cdot, \cdot)), \quad \text{for layers } l=1,2,...$

DGPs are universal function approximators but introduce significant inference complexity and loss of interpretability (Noack et al., 2023, Booth et al., 2023).

Kernel Warping: Non-stationarity is injected by transforming the inputs before applying a stationary kernel; for example, in warping-based models, a (potentially nonlinear) mapping $\mathbf{x} \mapsto \phi(\mathbf{x})$ (specified parametrically or via another GP) dictates the local covariance geometry (Booth et al., 2023, Tolpin, 2019).
Mixture/Local GPs and Partition Models: The input space is divided (via trees, Voronoi tessellation, or Dirichlet processes), with each region assigned a local stationary GP whose parameters are adapted independently or coupled via hierarchical/Markov structures. Locally coupled GPs with HMM-regularized parameter trajectories (Ambrogioni et al., 2016) and Mixed-Stationary GPs (MSGPs) with Dirichlet process clustering (Duan et al., 2018) exemplify such constructions.
Compactly Supported and Sparse Nonstationary Kernels: Using kernels with explicit spatial support (e.g., products of Matérn or Wendland polynomials with spatially adaptive “bump” functions), one can induce local correlation and manageable sparsity for efficient inference in massive data regimes (Risser et al., 7 Nov 2024).

2. Mathematical Formulations and Examples of Non-Stationary Kernels

A representative selection of non-stationary kernel formulations includes:

Spatially-Varying Amplitude:

$k(\mathbf{x}_i, \mathbf{x}_j) = g(\mathbf{x}_i) g(\mathbf{x}_j) k_\mathrm{stat}(\mathbf{x}_i, \mathbf{x}_j),$

where $g(\cdot)$ modulates local variance (Noack et al., 2023).

Generalized Spectral Mixture (GSM) Kernel:

$k_\mathrm{GSM}(x_i, x_j) = \sum_{q=1}^Q w_q(x_i) w_q(x_j)\, k_{\text{Gibbs},q}(x_i, x_j)\, \cos[2\pi(\mu_q(x_i)x_i - \mu_q(x_j)x_j)].$

Each mixture’s weights, means, and lengthscales may be input-dependent latent processes, granting the kernel rich nonstationarity (Ladopoulou et al., 13 May 2025).

Non-Stationary Matérn Kernel (Paciorek-Schervish):

$k(\mathbf{x}, \mathbf{x}') = \sigma(\mathbf{x})\sigma(\mathbf{x}') \frac{|\Sigma(\mathbf{x})|^{1/4} |\Sigma(\mathbf{x}')|^{1/4}}{|\frac{\Sigma(\mathbf{x}) + \Sigma(\mathbf{x}')}{2}|^{1/2}} \mathcal{M}_{\nu}\left(\sqrt{Q(\mathbf{x}, \mathbf{x}')}\right)$

where $Q$ is a local Mahalanobis distance interpolant, and $\Sigma(\mathbf{x})$ captures local anisotropy (Risser et al., 7 Nov 2024, Beckman et al., 2022).

Attentive Kernel (AK):

$\mathrm{AK}(\mathbf{x}, \mathbf{x}') = \alpha\, (\bar{\mathbf{z}}^\top \bar{\mathbf{z}}') \sum_{m=1}^M \bar{w}_m \bar{w}_m' k_m(\mathbf{x}, \mathbf{x}'),$

where $\mathbf{w}, \mathbf{z}$ are input-dependent, normalized attention weights; $k_m$ are base kernels (Chen et al., 2023).

Locally Coupled Kernel (LC-GP):

$k_\zeta(t, t'; \{\boldsymbol{\vartheta}_i\}) = \sum_{i} w(t; t_i) w(t'; t_i) k_i(t, t'; \boldsymbol{\vartheta}_i),$

with $w$ localized basis functions and $k_i$ stationary or nonstationary kernels with parameters $\boldsymbol{\vartheta}_i$ following a Markov process (Ambrogioni et al., 2016).

3. Empirical Performance and Use-Cases

Empirical studies demonstrate that non-stationary GPs substantially improve predictive accuracy and, crucially, uncertainty quantification—especially on data with locally-varying amplitude, smoothness, or dynamics:

Time Series with Regime Switches or Varying Frequency: Locally coupled GPs outperform stationary GPs in both state detection (e.g., phase transitions in brain oscillations) and denoising (Ambrogioni et al., 2016).
Environmental and Physical Simulation: Compactly supported nonstationary kernels permit exact GP inference for millions of geospatial measurements, yielding superior point and posterior predictive performance in climate data interpolation (Risser et al., 7 Nov 2024, Nychka et al., 2017).
Active Learning and Robotic Information Gathering: Non-stationary kernels (e.g., AK) provide well-calibrated, position-sensitive uncertainty, guiding data acquisition to informative or high-variation regions—improving map reconstruction and resource efficiency in autonomous systems (Chen et al., 2023).
Scientific Surrogates and Computer Experiments: Methods such as nonstationary latent-augmented GPs (Montagna et al., 2013) and deep Gaussian processes (Booth et al., 2023) have proven effective in resolving sharp local features and complex nonlinear dependencies in surrogate modeling for simulation codes.

The table below organizes core approaches, design principles, and computational profiles:

Class	Nonstationarity Mechanism	Interpretability	Computational Cost
Parametric	Input-dependent basis expansions	High	Moderate to high (grows with terms)
Deep Kernel / DGP	Neural embedding / compositional warping	Low	High to very high
Compactly Supported	Local bump functions, data-driven sparsity	Medium	Low (sparse algebra)
Mixture/Partition/Local	Piecewise or cluster-specific parameters	Medium-High	Varies (often scalable)

4. Computational Scalability and Approximation Strategies

Non-stationary GPs historically suffered from scalability bottlenecks due to the $O(N^3)$ cost of dense covariance algebra. Recent advances enable tractable inference at scale:

Covariance Sparsity: Compactly supported kernels yield sparse matrices, supporting scalable, parallelizable inference via sparse factorizations and Krylov subspace methods—demonstrated on $N \approx 10^6$ (Risser et al., 7 Nov 2024).
Block-Diagonal Plus Low-Rank (BDLR) Approximations: Decompose the global covariance as a sum of local block-diagonal and low-rank (Nyström) terms, enabling fast stochastic estimation of gradients and Hessians, and second-order optimization for high-dimensional nonstationary parameterizations (Beckman et al., 2022).
Structured Kernel Interpolation (SKI) and Warping (warpSKI): Use warped, possibly non-equidistant, inducing point grids to exploit Toeplitz/Kronecker structure even for nonstationary phase behavior (Graßhoff et al., 2019).
Local/Partitioned Submodels: Partition or locally approximate the process via region- or neighborhood-specific GPs, dramatically reducing complexity and enabling distributed inference (Booth et al., 2023).
Variational and Inducing Point Methods: Deep kernel and DGP frameworks commonly rely on sparse variational methodologies and mini-batch stochastic optimization to address large-scale, nonstationary learning (James et al., 16 Jul 2025, Booth et al., 2023).

5. Interpretability, Diagnostics, and Model Selection

Non-stationary kernels enhance model expressivity but introduce parameterization and diagnostic challenges:

Interpretability: Parametric nonstationary kernels and modular local models retain interpretable structure, with explicit spatial dependence and regularization. Deep learning-based approaches and DGPs, despite offering flexibility, are generally less transparent.
Risk of Overfitting: As model complexity increases, particularly with flexible basis or deep architectures, rigorous cross-validation and regularization are necessary to mitigate overfitting and ensure identifiability.
Model Diagnostics: The appropriateness of a nonstationary approach can be evaluated via residual analysis, calibration of predictive intervals, and data-driven hyperparameter diagnostics as described in (Noack et al., 2023).
Recommendation: Begin with stationary kernels for exploratory modeling; escalate to nonstationary, parametric, or deep approaches when evidence of nonstationarity is strong or when uncertainty quantification is consequential.

6. Reference Implementations and Software Ecosystem

Multiple open-source libraries facilitate nonstationary GP modeling, with differing focus:

hetGP, tgp, laGP (R): Heteroskedastic, treed, and local approximate GPs (Booth et al., 2023).
deepgp, dgpsi (R, Python): Fully Bayesian DGPs, scalable via Vecchia approximation or elliptical slice sampling (Booth et al., 2023).
GPflux, GPyTorch (Python): Deep kernel and variational deep GP frameworks, supporting flexible kernel composition and auto-differentiation (James et al., 16 Jul 2025).
Software for compactly supported and sparse kernels: High-performance codes (often in C++/Python) exploiting distributed and GPU architectures for ultra-large data (Risser et al., 7 Nov 2024).

7. Future Directions and Open Challenges

Despite considerable advances, several challenges persist:

Identifiability and Over-parameterization: As nonstationary models scale in flexibility, identifiability of parameterizations and interpretability of results become more difficult. Hierarchical priors and Bayesian regularization are active areas of research (Risser et al., 7 Nov 2024).
Efficient High-Dimensional Nonstationary Inference: Extending current scalable methods to fully nonstationary, high-dimensional settings remains demanding. Research into efficient summary statistics, multi-resolution schemes, and hybrid architectures continues (Nychka et al., 2017, Beckman et al., 2022).
Diagnostics and Auto-tuning: There remains a need for robust diagnostics, automatic kernel selection, and scalable hyperparameter tuning workflows for practitioners (Noack et al., 2023).
Integration with Active Learning and Decision-making: Formal integration of nonstationary uncertainty quantification into downstream tasks—active learning, adaptive experiment design, robotic exploration—will continue to drive application domains (Chen et al., 2023, Patel et al., 2022).

In conclusion, non-stationary Gaussian processes constitute a powerful and rapidly developing domain at the intersection of probabilistic modeling, computational mathematics, and scientific computing. Ongoing innovation in kernel design, scalable inference, and integration with deep learning will continue to expand their applicability and relevance across scientific disciplines.