Rates of Convergence for Sparse Variational Gaussian Process Regression (1903.03571v3)

Published 8 Mar 2019 in stat.ML and cs.LG

Abstract: Excellent variational approximations to Gaussian process posteriors have been developed which avoid the $\mathcal{O}\left(N^3\right)$ scaling with dataset size $N$. They reduce the computational cost to $\mathcal{O}\left(NM^2\right)$, with $M\ll N$ being the number of inducing variables, which summarise the process. While the computational cost seems to be linear in $N$, the true complexity of the algorithm depends on how $M$ must increase to ensure a certain quality of approximation. We address this by characterising the behavior of an upper bound on the KL divergence to the posterior. We show that with high probability the KL divergence can be made arbitrarily small by growing $M$ more slowly than $N$. A particular case of interest is that for regression with normally distributed inputs in D-dimensions with the popular Squared Exponential kernel, $M=\mathcal{O}(\log^D N)$ is sufficient. Our results show that as datasets grow, Gaussian process posteriors can truly be approximated cheaply, and provide a concrete rule for how to increase $M$ in continual learning scenarios.

Authors (3)

David R. Burt (18 papers)
Carl E. Rasmussen (9 papers)
Mark van der Wilk (61 papers)

Citations (145)

View on Semantic Scholar

Summary

The paper provides theoretical insights into how the number of inducing points must scale with dataset size in Sparse Variational Gaussian Process regression to ensure accurate posterior approximation.
It derives a priori bounds on the Kullback-Leibler divergence using eigenfunction and interdomain sparse approximations, guiding practical selection of inducing points.
The research proposes a determinant-based sampling method for selecting inducing points and demonstrates how these findings enable efficient GP modeling on large datasets.

Overview of "Rates of Convergence for Sparse Variational Gaussian Process Regression"

This paper by Burt, Rasmussen, and van der Wilk addresses the computational challenges associated with Gaussian Processes (GPs) when applied to large datasets. The authors focus on Sparse Variational Gaussian Process (SVGP) regression, a technique that offers a reduction in computational cost from $\BigO(N^3)$ to $\BigO(NM^2)$ through the use of $M$ inducing variables, where $M \ll N$ . While previous research has established computational efficiency in terms of $M$ , this paper explores how $M$ should scale with $N$ to ensure a well-approximated GP posterior.

Key Contributions

Scaling Laws and KL Divergence: The authors provide theoretical insights into how $M$ must scale with $N$ to ensure minimal Kullback-Leibler (KL) divergence between the approximate and true posterior. They demonstrate that under certain conditions, $M$ can grow sublinearly with $N. For Gaussian inputs with a Squared Exponential kernel,$ M=\BigO(\log^D N) $is sufficient.</li> <li>A Priori Bounds: The paper derives a priori bounds on the KL divergence utilizing eigenfunction inducing features and interdomain sparse approximations, offering practical guidance for selecting$ M $prior to seeing the data.</li> <li>Sampling Methods for Inducing Points: The authors propose a determinant-based sampling method for inducing point selection using a discrete k-Determinantal Point Process (k-DPP). They demonstrate that this sampling method ensures a high-quality approximation with minimal KL divergence.</li> <li>Multidimensional Extensions: Extending from one-dimensional cases, the research indicates that for D-dimensional inputs with a separable kernel and Gaussian input distribution, taking$ M=\BigO(\log^D N)$ results in effective GP approximations.

Implications

The paper's findings have significant implications for using GPs in large-scale machine learning tasks. By detailing how the number of inducing points needs to scale with dataset size, this work enables efficient GP modeling with limited computational resources. Practically, it guides the implementation of SVGP methods for continual learning where data is incrementally observed.

Future Directions

The paper opens several avenues for further exploration:

Non-Gaussian Distributions: Extension of the bounds to other likelihood functions beyond Gaussian, especially in models for classification.
Alternative Kernels and Input Distributions: Investigating the bounds for various kernels and considering real-world data distributions that diverge from common theoretical assumptions.
Computational Techniques: Development of faster algorithms for sampling k-DPPs could further reduce computational overhead in initializing inducing points.

Conclusion

The paper provides robust theoretical results supporting the scalable application of SVGP methods in regression tasks. By focusing on the KL divergence and offering comprehensive bounds for inducing point selection, the authors equip researchers with tools to effectively manage large datasets while preserving the integrity of GP models. This work sets a foundation for continual improvements in sparse GP modeling, promoting broader use in areas requiring efficient uncertainty quantification and prediction.

Related Papers

Tweets

https://twitter.com/CambridgeMLG/status/1778428946779889902