On the Nystrom Approximation for Preconditioning in Kernel Machines (2312.03311v4)
Abstract: Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.
- Toward large kernel models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663.
- Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138.
- Mechanism of feature learning in convolutional neural networks. arXiv preprint arXiv:2309.00570.
- Bhatia, R. (2010). Modulus of continuity of the matrix absolute value. Indian Journal of Pure and Applied Mathematics, 41:99–111.
- Bhatia, R. (2013). Matrix analysis, volume 169. Springer Science & Business Media.
- Deep equals shallow for relu networks in kernel regimes. arXiv preprint arXiv:2009.14397.
- Nytro: When subsampling meets early stopping. In Artificial Intelligence and Statistics, pages 1403–1411. PMLR.
- Kernel operations on the gpu, with autodiff, without memory overflows. The Journal of Machine Learning Research, 22(1):3457–3462.
- Preconditioning kernel matrices. In International conference on machine learning, pages 2529–2538. PMLR.
- Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31.
- Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31.
- Kato, T. (1973). Continuity of the map S→|S|→𝑆𝑆S\to\left|S\right|italic_S → | italic_S | for linear operators. Proceedings of the Japan Academy, 49(3):157 – 160.
- Diving into the shallows: a computational perspective on large-scale shallow learning. Advances in neural information processing systems, 30.
- Kernel machines that adapt to gpus for effective large batch training. Proceedings of Machine Learning and Systems, 1:360–373.
- Gpflow: A gaussian process library using tensorflow. J. Mach. Learn. Res., 18(40):1–6.
- Kernel methods through the roof: handling billions of points efficiently. Advances in Neural Information Processing Systems, 33:14410–14422.
- Feature learning in neural networks and kernel machines that recursively learn features. arXiv preprint arXiv:2212.13881.
- On learning with integral operators. Journal of Machine Learning Research, 11(2).
- Falkon: An optimal large scale kernel method. Advances in neural information processing systems, 30.
- Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, pages 807–814.
- Neural kernels without tangents. In International conference on machine learning, pages 8614–8623. PMLR.
- Using the nyström method to speed up kernel machines. In Leen, T., Dietterich, T., and Tresp, V., editors, Advances in Neural Information Processing Systems, volume 13. MIT Press.