Locally Smoothed Gaussian Process Regression

Updated 22 April 2026

The paper introduces LSGPR, a method that restricts training point influence via localization kernels to enhance scalability and adaptivity.
It leverages techniques like adaptive neighbor selection and local hyperparameter tuning to mitigate issues of global GP regression.
Empirical evaluations reveal that LSGPR achieves superior prediction accuracy with significant computational speedups on benchmark datasets.

Locally Smoothed Gaussian Process Regression (LSGPR) refers to a class of Gaussian process regression methodologies that exploit localized structure in the input space to enhance computational tractability, statistical adaptivity, and prediction quality. By restricting or down-weighting the influence of distant training points on the prediction at any query location, LSGPR achieves a balance between the expressive power of GP models and the scalability or nonstationarity requirements encountered in many scientific and engineering applications. This paradigm encompasses a diversity of concrete frameworks, notably including weighting schemes via localization kernels, explicit partitioning, adaptive neighbor selection, and joint estimation of local hyperparameters or boundaries.

1. Theoretical Foundation: Localizing the Gaussian Process Prior

Standard Gaussian process regression for data $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ places a Gaussian process prior $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ on the regression function, implying that all $n$ training points can exert global influence on the prediction at any test location $x_0$ . Classically, the predictive mean and variance at $x_0$ are given by: $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ where $K_{XX}$ is the $n \times n$ kernel Gram matrix and $\sigma^2$ is the noise variance.

LSGPR modifies this paradigm by introducing a localization kernel $k_h(x, x_0)$ centered at $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 0 with localization scale $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 1, typically chosen such that $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 2 rapidly decays or is zero when $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 3. The prior and data are multiplied by $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 4, yielding a localized GP prior: $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 5 and localized observations $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 6. Only training points with $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 7 are included in the regression at $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 8, leading to sparsification of the Gram matrix and strictly local influences (Gogolashvili et al., 2022).

2. Localization Kernels and Neighborhood Construction

The choice of localization kernel $f \sim \mathcal{GP}(0, K(\cdot, \cdot))$ 9 is pivotal in LSGPR. Common examples include:

Kernel Type	Formula	Support
Rectangular	$n$ 0	Compact ( $n$ 1)
Epanechnikov	$n$ 2	Compact ( $n$ 3)
Gaussian	$n$ 4	All $n$ 5
Hilbert (Sing.)	$n$ 6	Compact ( $n$ 7)

where $n$ 8 is the volume of the unit $n$ 9-ball. Compactly supported $x_0$ 0 enforces strict localization; Gaussian weights allow soft decay.

Neighborhoods may also be defined adaptively, e.g., via $x_0$ 1-nearest neighbors with lengthscale-weighted metrics or by identifying regions from data-driven partitions, as in splitting or partitioned GPs (Terry et al., 2020, Sung et al., 2016).

3. Inference and Prediction in the Localized GP Framework

Let $x_0$ 2 denote the set of indices of training points sufficiently close to $x_0$ 3. The locally modified Gram matrix is: $x_0$ 4 where $x_0$ 5.

Posterior mean and variance for $x_0$ 6 are then: $x_0$ 7 Hyperparameters (kernel, noise, localization width) are selected by maximizing the local marginal log-likelihood on each neighborhood, with $x_0$ 8 typically chosen by grid search or by ensuring a minimum neighbor count (Gogolashvili et al., 2022).

The computational cost per test is $x_0$ 9 for an $x_0$ 0 system, with $x_0$ 1. This is in sharp contrast with the $x_0$ 2 scaling of global GPR.

4. Extensions: Adaptive Partitioning, Boundary Learning, and Local Hyperparameters

LSGPR gracefully accommodates a spectrum of extensions, each leveraging the local structure for improved adaptivity:

Joint Local Boundary Estimation: Estimating piecewise continuous functions with discontinuities, as in "Jump GP," involves constructing a local linear separator $x_0$ 3, and keeping only those local neighbors on the same side as $x_0$ 4, with $x_0$ 5 learned jointly with kernel hyperparameters (Park, 2021).
Streaming and Partitioned GPs: Partitioning the input space adaptively (via principal direction divisive partitioning or similar) and fitting local GPs to each leaf enables bounded $x_0$ 6 update time and linear memory scaling, suitable for streaming or massive datasets. Smoothing across boundaries is achieved by weighted aggregation of regionwise predictions (Terry et al., 2020).
Locally Adaptive Weights and Distances: Adaptive neighborhood selection can weight training points by lengthscale-informed Mahalanobis distances or insert explicit weights into the kernel matrix construction, as in variational sparse-spectrum and local feature GP regression (Tan et al., 2013, Meier et al., 2014).

These adaptations yield models that can locally select relevant features, tune smoothness to the local nonstationarity of the function of interest, and more accurately capture discontinuities or regime shifts.

5. Empirical Evaluation and Comparative Performance

A spectrum of benchmarks demonstrates that LSGPR models deliver competitive or superior predictive accuracy compared to both global GPR and alternative local/mixed models, typically at orders of magnitude lower computational cost per prediction (Gogolashvili et al., 2022).

For example, on UCI benchmarks (Yacht, Boston, Concrete, Kin8nm, Powerplant, Protein), LSGPR with Hilbert localization kernel achieved lower test MSE than both full GP and deep GP baselines in several cases, with local models operating on as few as $x_0$ 7– $x_0$ 8 neighbors per query and observed speedups on the order of $x_0$ 9 for large $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 0 (Gogolashvili et al., 2022). "Jump GP" and partitioned LSGPR both report strong mitigation of boundary bias and improved performance in piecewise-continuous and partitioned-regime settings (Park, 2021, Terry et al., 2020).

6. Practical Guidelines and Trade-Offs

Selecting the localization scale $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 1 (or neighbor count $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 2) involves a bias-variance and speed-accuracy trade-off:

Small $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 3 (fewer neighbors): high adaptivity, lower computational burden, greater variance.
Large $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 4 (more neighbors): diminished locality, higher computation, predictions converge to global GPR.

Compactly supported localizers (rectangular, Epanechnikov) ensure sparse Gram matrices and computational parsimony. Soft/decaying kernels (Gaussian) may give smoother predictions but reduced sparsity.

Hyperparameters should be cross-validated locally. A kd-tree or similar spatial index is advised for efficient neighbor queries. For stability, inputs should be standardized.

Local smoothing inherently induces nonstationary behavior in the predictive covariance, even with stationary base kernels. This adaptivity is a primary strength, allowing LSGPR to accommodate heteroskedasticity, varying function smoothness, and regime shifts (Gogolashvili et al., 2022).

7. Limitations and Open Challenges

Limitations of LSGPR methods include:

Loss of some global structure in favor of local adaptation; very small $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 5 can lead to high-variance or unstable predictions.
For discontinuities, the quality of boundary estimation or partitioning is critical.
Selection of $m(x_0) = K_{x_0X}(K_{XX} + \sigma^2 I)^{-1} y, \qquad v(x_0) = K(x_0, x_0) - K_{x_0X}(K_{XX} + \sigma^2 I)^{-1}K_{Xx_0},$ 6, kernel type, and neighbor count remains problem-dependent and lacks universal automated prescription.
In very high-dimensional settings or for highly non-separable kernels, neighbor finding and kernel matrix assembly can still become costly.

Current research directions include extensions to more sophisticated adaptive neighborhood selection, combination with deep learning architectures for feature extraction, batch and online selection strategies, and theoretical guarantees of local model consistency under sparse sampling (Park, 2021, Terry et al., 2020).

For further mathematical and algorithmic details as well as implementation-focused recipes, see (Gogolashvili et al., 2022, Park, 2021, Terry et al., 2020), and (Sung et al., 2016).