Skew Gaussian Processes

Updated 16 April 2026

Skew Gaussian Processes are statistical models that extend standard Gaussian processes by incorporating explicit transformation maps to capture skewed, non-Gaussian behavior.
They utilize layered transport mechanisms such as marginal warping, covariance mixing, and normalizing flows to achieve flexible dependency structures and heavy-tailed distributions.
These models are applied in regression, spatiotemporal dynamics, and physical advection, offering scalable inference, calibrated uncertainty, and improved expressiveness.

A Transport Gaussian Process (TGP) is any stochastic process constructed by pushing forward a "base" law—typically a standard Gaussian process or Gaussian white-noise process—through a sequence of explicit, often invertible transformations ("transport maps") acting on the finite-dimensional marginals or trajectories. This paradigm unifies standard GPs, warped GPs, Student-t processes, and a broad family of models with non-Gaussian marginals and complex dependency structures. In parallel, "transport" in TGPs also refers to the modeling of physical transport phenomena, such as advection by a latent velocity field, by parametrizing nonstationary or SPDE-constrained covariance structures. The methodologies reviewed below encompass both probabilistic transport in function space and explicit modeling of advection or optimal transport in probability space.

1. Construction Frameworks for Transport Gaussian Processes

Two mathematically rigorous mechanisms have been proposed for defining TGPs:

Push-forward framework for stochastic processes: Given a white-noise process $\xi = \{\xi(t)\}_{t\in T}$ (with finite-dimensional laws $N_n(0, I_n)$ ), and an $\eta$ -consistent collection of measurable transport maps $T_n : \mathbb{R}^n\to\mathbb{R}^n$ (where $n$ is the number of input points), define $f = T(\xi)$ with finite-dimensional law $\pi_n = T_n\#\eta_n$ . Kolmogorov consistency ensures the resulting $f$ is a well-defined stochastic process. The maps $T$ can be composed to impart desired marginal, copula, or dependence structures, e.g., through marginal warping, covariance mixing, radial / copula layers (elliptical, Archimedean), or their combinations (Rios, 2020).
Normalizing flow formulation: A base GP $g(x)\sim GP(\mu(x), k(x,x'))$ is "transported" through an invertible flow $N_n(0, I_n)$ 0 (possibly input-dependent, e.g., parametrized by a neural network), yielding $N_n(0, I_n)$ 1. The marginal prior $N_n(0, I_n)$ 2 is computed by change-of-variables: $N_n(0, I_n)$ 3 (2011.01596). This approach permits imposing constraints (boundedness, monotonicity, nonstationarity) while maintaining differentiability and tractability.

Both paradigms allow the induced process to retain key invariance, closure, or support properties provided by the elementary layers. Each layer can be explicitly characterized: marginal transformations (e.g., Box–Cox), linear mixing (i.e., Cholesky or covariance layer), radial transformations for elliptical or Archimedean copulas, and complex flows for normalizing-flow GP variants.

2. Transport GPs for Regression: Expressiveness and Inference

Transport Gaussian Processes for Regression define a modular architecture where each layer in the transport composition modulates a distinct statistical property:

Marginal layers introduce location shift, monotonic warping, or non-Gaussianity for individual $N_n(0, I_n)$ 4 marginals.
Covariance layers (linear mixing) effect arbitrary covariance kernel structures.
Elliptical/radial and Archimedean layers introduce heavy-tails (Student-t, inverse Gamma) or extreme-value dependence (Clayton, Gumbel copulas), controlling the copula structure of joint laws (Rios, 2020).

The induced process law for $N_n(0, I_n)$ 5 at locations $N_n(0, I_n)$ 6 is

$N_n(0, I_n)$ 7

where $N_n(0, I_n)$ 8 and $N_n(0, I_n)$ 9 is the standard normal density.

Learning is performed by maximizing the marginal likelihood or its penalized version, with gradients accumulated layerwise (each invertible, tractable). For fully triangular transports, posterior predictive draws decompose as transformations of GP conditional samples; explicit posterior mean and credible intervals are available for warping/covariance layers, while general compositions allow for straightforward MCMC due to explicit densities (Rios, 2020). Empirical results confirm substantial improvements in handling boundedness, heavy tails, and tail dependence over classical or warped GPs.

3. Transport GPs in Spatiotemporal Modeling and Physical Advection

TGPs have been advanced as a rigorous framework for modeling the movement of scalar (and vector) fields undergoing advection:

Physics-constrained TGPs: Model a scalar field $\eta$ 0 as observed under the action of a time-dependent velocity field $\eta$ 1, with the transport equation

$\eta$ 2

relaxed to allow for intrinsic temporal evolution. The covariance structure is defined by

$\eta$ 3

where $\eta$ 4 is a learned invertible "backward flow" parameterized via residual neural networks, ensuring bijectivity and tractable differentials. The mean and hyperparameters are fitted via joint maximum likelihood, and the velocity recovered analytically: $\eta$ 5 (Fahmy et al., 16 May 2025).

This formulation delivers physically meaningful, coherent velocity fields for large-scale satellite data, outperforming conventional local tracking in terms of smoothness, spatial coverage, and empirical error (Fahmy et al., 16 May 2025).

4. Optimal Transport, Barycenters, and Distributional Inputs

Optimal transport theory supports several TGP-related constructions:

Gaussian process barycenters (ensemble aggregation): By interpreting each local GP expert's predictive distribution as a measure, product-of-expert models can be improved using the 2-Wasserstein barycenter. For 1D Gaussians, the barycenter of $\eta$ 6 experts (means $\eta$ 7, variances $\eta$ 8) is the unique minimizer of

$\eta$ 9

with closed-form $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 0, $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 1, and softmax-based confidence weights $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 2 calibrating contributions (Cohen et al., 2021).

Kernels on probability measures via entropic OT: Entropic Sinkhorn-regularized costs between pairs $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 3 define Hilbertian embeddings $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 4 in $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 5 (reference probability space), leading to positive definite radial kernels $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 6 for GPs indexed by (empirical or true) distributions (Bachoc et al., 2022). Such kernels are universal on the weak topology, strictly positive definite under mild conditions, and admit efficient computation via Sinkhorn iterations and automatic differentiation.
Multivariate distributional input kernels: For Gaussian distributions with known covariances, composition with the Wasserstein barycenter and explicit transport maps produces strictly positive definite GP kernels on $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 7 via Hilbert space embeddings (Bachoc et al., 2018). These admit microergodicity of all parametric family parameters in infinite dimension, permitting consistent model selection.

5. Optimal Transport Distances Between Gaussian Processes

Computation and geometry of optimal transport distances between Gaussian (process) laws are central in model-based comparison, calibration, and barycenter construction:

Finite/infinite-dimensional Bures-Wasserstein distance: For $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 8-dimensional (possibly degenerate) Gaussian laws $T_n : \mathbb{R}^n\to\mathbb{R}^n$ 9, $n$ 0, the squared $n$ 1-Wasserstein (Bures) distance is

$n$ 2

with optimal transport maps characterized via operator means even in infinite-dimensional / degenerate settings (Yun et al., 25 Dec 2025). Operator-theoretic factorization (Green’s operator, Schur complement) yields closed-form generalized Monge couplings, and interpolated barycenters correspond to convex hulls in operator space.

Adapted (bicausal/causal) Wasserstein distance: For temporal Gaussian processes $n$ 3 on $n$ 4, the adapted 2-Wasserstein $n$ 5 distance is

$n$ 6

where $n$ 7, $n$ 8 are Cholesky factors. This metric explicitly enforces time-causal coupling restrictions, with efficient $n$ 9 computation (Gunasingam et al., 2024). The construction elucidates differences with classical $f = T(\xi)$ 0 and provides closed-form bicausal OT maps.

Entropic OT and Sinkhorn diverences for GPs: In Hilbert spaces, the regularized OT between Gaussian measures admits closed-form optimal couplings and costs involving trace and Fredholm determinant terms. Differentiability and unique barycenters hold under broad conditions, with limiting behaviors interpolating between $f = T(\xi)$ 1 and maximum mean discrepancy (MMD) (Quang, 2020).

6. Computational and Statistical Implications

Transport GP frameworks maintain, or in some cases reduce, computational complexity compared to standard GPs when leveraging structure:

Scalability: Sparse/inducing-point approximations, low-rank kernel expansions, and stochastic mini-batch training techniques are fully compatible with most transport and flow-based TGP formulations (2011.01596, Fahmy et al., 16 May 2025).
Optimization: Layerwise or fully automatic differentiation is available for both flow parameters and GP (hyper)parameters due to explicit density computations, facilitating gradient-based optimization even through Sinkhorn iterations or neural ODEs (Bachoc et al., 2022, Fahmy et al., 16 May 2025).
Uncertainty Quantification: Transport GPs preserve or enhance calibrated uncertainty, especially when using barycenter or copula-based models. In distributed settings, transport-based aggregation provides robust, smooth, and more reliable predictive variances over classical PoE-type models (Cohen et al., 2021).
Statistical consistency: Microergodicity of hyperparameters in Hilbert-space-embedded OT kernels ensures parameter identifiability and consistency (Bachoc et al., 2018).

7. Applications and Theoretical Significance

Transport GPs have achieved state-of-the-art results in several domains:

Physical field interpolation and dynamics: Wind field retrieval from satellite imagery, where TGPs offer smooth, physically plausible vector fields even under poor feature contrast (Fahmy et al., 16 May 2025).
Machine learning on distributions: Classification, regression, and structured prediction where inputs are distributions, sets, or textures, leveraging OT-based kernels (Bachoc et al., 2022, Bachoc et al., 2018).
Heavy-tailed and bounded regression: Financial series, environmental phenomena, and biological signals requiring processes that depart from Gaussian assumptions in marginal or dependency structure (Rios, 2020).
Ensemble modeling and federated learning: Efficient, robust product-of-expert models via Wasserstein barycenter aggregation (Cohen et al., 2021).

The general strategy of transporting GPs via analytically or algorithmically tractable maps synthesizes advances in kernel methods, normalizing flows, optimal transport, and machine learning for distributions, achieving models with interpretable structure, improved expressiveness, and scalable inference.

Key Citations:

(Rios, 2020) Transport Gaussian Processes for Regression
(2011.01596) Transforming Gaussian Processes With Normalizing Flows
(Fahmy et al., 16 May 2025) Estimating Velocity Vector Fields of Atmospheric Winds using Transport Gaussian Processes
(Bachoc et al., 2022) Gaussian Processes on Distributions based on Regularized Optimal Transport
(Yun et al., 25 Dec 2025) Gaussian Optimal Transport Beyond Brenier's Theorem
(Gunasingam et al., 2024) Adapted optimal transport between Gaussian processes in discrete time
(Quang, 2020) Entropic regularization of Wasserstein distance between infinite-dimensional Gaussian measures and Gaussian processes
(Cohen et al., 2021) Healing Products of Gaussian Processes
(Bachoc et al., 2018) Gaussian processes with multidimensional distribution inputs via optimal transport and Hilbertian embedding