Multi-Layer Kernel Machines: Fast and Optimal Nonparametric Regression with Uncertainty Quantification (2403.09907v1)
Abstract: Kernel ridge regression (KRR) is widely used for nonparametric regression over reproducing kernel Hilbert spaces. It offers powerful modeling capabilities at the cost of significant computational costs, which typically require $O(n3)$ computational time and $O(n2)$ storage space, with the sample size n. We introduce a novel framework of multi-layer kernel machines that approximate KRR by employing a multi-layer structure and random features, and study how the optimal number of random features and layer sizes can be chosen while still preserving the minimax optimality of the approximate KRR estimate. For various classes of random features, including those corresponding to Gaussian and Matern kernels, we prove that multi-layer kernel machines can achieve $O(n2\log2n)$ computational time and $O(n\log2n)$ storage space, and yield fast and minimax optimal approximations to the KRR estimate for nonparametric regression. Moreover, we construct uncertainty quantification for multi-layer kernel machines by using conformal prediction techniques with robust coverage properties. The analysis and theoretical predictions are supported by simulations and real data examples.
- Learning scalable deep kernels with recurrent structure. Journal of Machine Learning Research, 18(1):2850–2886.
- Conformal prediction: A gentle introduction. Found. Trends Mach. Learn., 16(4):494–591.
- Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404.
- Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(6).
- Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845.
- Bertin-Mahieux, T. (2011). YearPredictionMSD. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C50K61.
- The million song dataset.
- A representer theorem for deep kernel learning. Journal of Machine Learning Research, 20(1):2302–2333.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
- Kernel spline regression. Canadian Journal of Statistics, 33(2):259–278.
- Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
- Kernel methods for deep learning. Advances in Neural Information Processing Systems, 22.
- Confidence interval prediction for neural network models. IEEE Transactions on Neural Networks, 7(1):229.
- Orthogonalized kernel debiased machine learning for multimodal data analysis. Journal of the American Statistical Association, pages 1–15.
- Kernel knockoffs selection for nonparametric additive models. Journal of the American Statistical Association, pages 1–13.
- Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215. PMLR.
- Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM.
- How deep are deep gaussian processes? Journal of Machine Learning Research, 19(54):1–46.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574.
- Geer, S. A. (2000). Empirical Processes in M-estimation, volume 6. Cambridge University Press.
- Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211–2268.
- Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778.
- Data-driven thermal model inference with armax, in smart environments, based on normalized mutual information. In 2018 Annual American Control Conference (ACC), pages 4634–4639. IEEE.
- Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1):82–95.
- Sparsity in multiple kernel learning. Annals of Statistics, 38(6):3660–3695.
- Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5(Jan):27–72.
- Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111.
- Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128–7148.
- Kernel meets sieve: post-regularization confidence bands for sparse additive model. Journal of the American Statistical Association, 115(532):2084–2099.
- Mallat, S. (2016). Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150203.
- Prediction of environment variables in precision agriculture using a sparse model as data fusion strategy. Information Processing in Agriculture, 9(2):171–183.
- Mendelson, S. (2002). Geometric parameters of kernel machines. In International Conference on Computational Learning Theory, pages 29–43. Springer.
- Random features for large scale kernel machines. In Advances in Neural Information Processing Systems, volume 20.
- Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335–366.
- Gaussian Processes for Machine Learning. Springer.
- Sparse additive models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(5):1009–1030.
- SML2010. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5RS3S.
- Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30.
- On inference for the support vector machine. Preprint, pages 1–51.
- Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning.
- Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875–1897.
- Linear Regression Analysis, volume 330. John Wiley & Sons.
- Multi-scale zero-order optimization of smooth functions in an rkhs. In 2022 IEEE International Symposium on Information Theory, pages 288–293. IEEE.
- Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
- Algorithmic Learning in a Random World, volume 29. New York: Springer.
- Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society: Series B (Methodological), 45(1):133–150.
- Wahba, G. (1990). Splines Models for Observational Data. SIAM, Philadelphia, PA.
- Early stopping for kernel boosting algorithms: a general analysis with localized complexities. In Advances in Neural Information Processing Systems, volume 30.
- Deep kernel learning. In Artificial intelligence and statistics, pages 370–378. PMLR.
- Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991 – 1023.
- Rademacher complexity for adversarially robust generalization. In International Conference on Machine Learning, pages 7085–7094. PMLR.
- Estimation and inference for nonparametric expected shortfall regression over rkhs. Preprint, pages 1–35.
- Component selection and smoothing for nonparametric regression in exponential families. Statistica Sinica, pages 1021–1041.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.