Gaussian Process Priors
- Gaussian Process Priors are probability measures on function spaces defined by a mean function and a covariance kernel that controls smoothness, scale, and adaptability.
- Rescaling kernels, such as Matérn or Confluent Hypergeometric, enables minimax-optimal posterior contraction rates by aligning the kernel properties with the target function's regularity.
- Hierarchical Bayesian models using hyperpriors on rescaling parameters achieve full adaptation to unknown smoothness, enhancing predictive accuracy and uncertainty quantification.
A Gaussian process prior is a probability measure on a function space, fully characterized by its mean function and positive-definite covariance function (kernel). In nonparametric statistics, machine learning, spatial modeling, and computational sciences, Gaussian process (GP) priors provide a flexible Bayesian framework for expressing distributions over function-valued unknowns. The properties of the prior—including regularity, smoothness, adaptivity, and expressivity—depend critically on the choice of kernel, kernel rescaling, and any hierarchical modeling of kernel hyperparameters. Posterior contraction rates under GP priors are central to understanding their frequentist performance and minimax-optimality in regression and other estimation settings.
1. Fundamentals of Gaussian Process Priors
A zero-mean Gaussian process prior is defined on a real-valued function through its covariance function , where for any finite set the random vector has a multivariate normal distribution with mean zero and covariance matrix . The covariance kernel determines sample path regularity: smooth kernels yield smoother random functions, and parameters control scale, smoothness, periodicity, and other properties. When used as priors in nonparametric regression, e.g., , Bayesian updating under the likelihood yields a posterior distribution on the function space.
Posterior contraction rates for GP priors are a primary concern: for a true function of regularity , the posterior should concentrate around at the minimax-optimal rate for observations in dimensions. Without adaptation or correct scaling or kernel selection, GP priors may fail to achieve optimality.
2. Matérn and Confluent Hypergeometric Covariance Functions
The isotropic Matérn covariance family,
with smoothness parameter , length-scale , and modified Bessel function , permits precise control of sample path mean-square differentiability through .
The Confluent Hypergeometric (CH) covariance class is a newer family parametrized as
where is the confluent hypergeometric function of the second kind, sets the polynomial tail index, and is a length-scale. The CH class supports flexible polynomial tail decay and mean-squared smoothness, offering an additional degree of modeling freedom beyond the Matérn class.
Both classes are crucial for matching the GP prior’s smoothness properties to those of the underlying function being modeled.
3. Posterior Contraction and the Role of Rescaling
The posterior contraction rate for a GP prior depends on the "concentration function"
where is the reproducing kernel Hilbert space (RKHS) of the prior. To achieve minimax-optimal contraction rate for an -regular , the kernel’s smoothness and the length-scale must be matched or adapted to . Without calibration, as with the squared exponential kernel or unmatched Matérn smoothness, contraction can be suboptimal with slower (possibly logarithmic or slower) rates.
The paper proves that optimal rates are obtained by rescaling the kernel: for the Matérn family, by defining and choosing as a function of and the desired regularity,
the GP prior can achieve rate for any . Rescaling for the CH family uses the length scale analogously. Thus, the key statistical benefit of rescaling is that the minimax rate is attainable even when the prior smoothness parameter (e.g., or ) does not match the true smoothness .
4. Hierarchical Bayesian Procedures and Adaptation
The optimal rescaling parameter depends on the unknown regularity of ; hence, the authors analyze a hierarchical Bayesian model by placing a hyperprior on the rescaling parameter. Specifically, for (or ), they use a prior density,
where, for example, a Gamma distribution on is eligible. The results show that the posterior under this hyperprior still contracts at the minimax-optimal rate for a whole range of regularities , i.e., the procedure adapts to without prior knowledge. Thus, full adaptation is achieved, and no “plug-in” or “oracle” knowledge is required for optimal convergence.
5. Applications and Empirical Performance
The theoretical results are motivated and supported by application to nonparametric regression with fixed design. The regression function is modeled as a sample path from the GP prior, and interest centers on predictive performance and uncertainty quantification.
Extensive simulation studies in one and two dimensions establish that rescaled Matérn and CH GP priors outperform standard squared exponential GPs—especially when the true function is rougher (e.g., Brownian motion). The empirical measures computed include mean squared prediction error, coverage of credible intervals, and interval length. Hierarchical procedures (with hyperpriors on the rescaling parameter) consistently yield coverage near the nominal level and competitive or superior predictive accuracy. In a real-data case involving geospatial prediction of atmospheric NO (latitude-longitude coordinates), the CH prior with hierarchical rescaling produces short credible intervals with near-nominal coverage compared to Matérn or squared exponential alternatives.
6. Theoretical and Practical Implications
- Rescaling enables minimax-optimal contraction for GP priors with Matérn or CH kernels over an entire scale of target smoothness classes, decoupling the optimality condition from the prior’s smoothness setting.
- Hierarchical Bayesian construction with a prior on the rescaling parameter yields adaptation to unknown smoothness.
- The minimax-optimality and full adaptivity justify the use of rescaled or hierarchical Matérn/CH GP priors in practical regression and spatial modeling scenarios.
- Compared to previous results for unrescaled GPs, rescaled and hierarchical Matérn/CH GPs avoid extraneous logarithmic factors in contraction rates and deliver improved frequentist guarantees.
Covariance Class | Smoothness Control | Tail Decay | Rescaling Benefits |
---|---|---|---|
Matérn | (differentiability) | Exponential | Minimax-optimal rate via |
Confluent Hypergeometric | , | Polynomial | Minimax-optimal rate via |
7. Summary
Posterior contraction under Gaussian process priors is critically determined by the interaction between the covariance function's smoothness and scaling (length-scale) parameters and the regularity of the true function. Through proper rescaling of the kernel and the use of a hierarchical prior on the rescaling parameter, both the Matérn and CH covariance classes can deliver fully adaptive, minimax-optimal posterior contraction rates in nonparametric regression with fixed design. This approach permits flexible modeling of function smoothness and tail decay while upholding strong frequentist guarantees for both prediction and uncertainty quantification (Fang et al., 2023).