Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning (2310.04357v3)
Abstract: We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters. Our results hold for a broad class of asymptotically free sketches under very mild data assumptions. For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally optimized by only tuning sketch size in infinite ensembles. For general subquadratic prediction risk functionals, we extend GCV to construct consistent risk estimators, and thereby obtain distributional convergence of the GCV-corrected predictions in Wasserstein-2 metric. This in particular allows construction of prediction intervals with asymptotically correct coverage conditional on the training data. We also propose an "ensemble trick" whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles. We empirically validate our theoretical results using both synthetic and real large-scale datasets with practical sketches including CountSketch and subsampled randomized discrete cosine transforms.
- MISSION: Ultra large-scale feature selection using count-sketches. In International Conference on Machine Learning, 2018.
- Randomized numerical linear algebra: A perspective on the field with an eye to software. arXiv preprint arXiv:2302.11474, 2023.
- Joel A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. Advances in Adaptive Data Analysis, 03:115–126, 2011.
- David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.
- Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
- Random projections for large-scale regression. In Big and Complex Data Analysis, Contributions to Statistics. Springer, 2017.
- Asymptotics of the sketched pseudoinverse. arXiv preprint arXiv:2211.03751, 2022.
- The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics, 2020.
- Consistent risk estimation in high-dimensional linear regression. arXiv preprint arXiv:1902.01753, 2019.
- Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403, 1979.
- The Elements of Statistical Learning. Springer Series in Statistics, 2009. Second edition.
- Uniform consistency of cross-validation estimators for high-dimensional ridge regression. In International Conference on Artificial Intelligence and Statistics, 2021.
- More than a toy: Random matrix models predict how real-world neural representations generalize. arXiv preprint arXiv:2203.06176, 2022.
- Dan V. Voiculescu. Free Probability Theory. American Mathematical Society, 1997.
- Free Probability and Random Matrices. Springer, 2017.
- Finding frequent items in data streams. Theoretical Computer Science, 312(1):3–15, 2004. ISSN 0304-3975.
- Convergence analysis of block coordinate algorithms with determinantal sampling. In International Conference on Artificial Intelligence and Statistics, 2020.
- Newton-LESS: Sparsification without trade-offs for the sketched Newton update. In Advances in Neural Information Processing Systems, 2021a.
- Sparse sketches with small inversion bias. In Proceedings of Conference on Learning Theory, 2021b.
- Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
- Ridge regression: Structure, cross-validation, and sketching. In International Conference on Learning Representations, 2020.
- Francis Bach. High-dimensional analysis of double descent for linear regression with random projections. SIAM Journal on Mathematics of Data Science, 6(1):26–50, 2024.
- Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1–113, 2023.
- Subsample ridge ensembles: Equivalences and generalized cross-validation. In International Conference on Machine Learning, 2023.
- Generalized equivalences between subsampling and ridge regularization. Advances in Neural Information Processing Systems, 36, 2024.
- Sketched ridgeless linear regression: The role of downsampling. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 5296–5326. PMLR, 2023.
- On high-dimensional asymptotic properties of model averaging estimators. arXiv preprint arXiv:2308.09476, 2023.
- A Distribution-free Theory of Nonparametric Regression. Springer Series in Statistics, 2006.
- A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79, 2010.
- Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1):95–112, 2015.
- Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979.
- The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, 2020.
- Estimating functionals of the out-of-sample error distribution in high-dimensional ridge regression. In International Conference on Artificial Intelligence and Statistics, 2022.
- The distribution of ridgeless least squares interpolators. arXiv preprint arXiv:2307.02044, 2023.
- The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785, 2011.
- The Lasso with general Gaussian designs with applications to hypothesis testing. The Annals of Statistics, 51(5):2194 – 2220, 2023.
- Pierre C. Bellec. Out-of-sample error estimation for M-estimators with convex penalty. Information and Inference: A Journal of the IMA, 12(4):2782–2817, 2023.
- Derivatives and residual distribution of regularized M-estimators with application to adaptive tuning. In Conference on Learning Theory, 2022.
- Ker-Chau Li. From Stein’s unbiased risk estimates to the method of generalized cross validation. The Annals of Statistics, 1985.
- Ker-Chau Li. Asymptotic optimality of CLsubscript𝐶𝐿C_{L}italic_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics, 14(3):1101–1112, 1986.
- A. Girard. A fast Monte-Carlo cross-validation procedure for large least squares problems with noisy data. Numerische Mathematik, 56(1):1–23, 1989.
- M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 18(3):1059–1076, 1989.
- Dimitri Shlyakhtenko. Free probability of type-B and asymptotics of finite-rank perturbations of random matrices. Indiana University Mathematics Journal, 67(2):971–991, 2018.
- Freeness of type B𝐵Bitalic_B and conditional freeness for random matrices. arXiv preprint arXiv:2205.01926, 2022.
- Random matrix theory and wireless communications. Foundations and Trends in Communications and Information Theory, 1(1):1–182, 2004.
- Philippe Biane. Processes with free increments. Mathematische Zeitschrift, 227(1):143–174, 1998.
- WONDER: Weighted one-shot distributed ridge regression in high dimensions. Journal of Machine Learning Research, 21(66):1–52, 2020.
- Distributed linear regression by averaging. The Annals of Statistics, 49(2):918–943, 2021.
- Optimal iterative sketching methods with the subsampled randomized Hadamard transform. In Advances in Neural Information Processing Systems, 2020.
- The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. Journal of Machine Learning Research, 21:169–1, 2020.
- Denny Wu and Ji Xu. On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123, 2020.
- Asymptotics of ridge (less) regression under general source condition. In International Conference on Artificial Intelligence and Statistics, pages 3889–3897. PMLR, 2021.
- RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
- The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10):1113–1120, 2013.
- Cédric Villani. Optimal Transport: Old and New. Springer, 2008.
- Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In International Conference on Learning Representations, 2021.
- Corrected generalized cross-validation for finite ensembles of penalized estimators. arXiv preprint arXiv:2310.01374, 2023.
- Failures and successes of cross-validation for early-stopped gradient descent. In International Conference on Artificial Intelligence and Statistics, 2024.
- The flip side of the reweighted coin: Duality of adaptive dropout and regularization. In Advances in Neural Information Processing Systems, 2021.
- A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):965–996, 2020.
- Arup Bose. Random Matrices and Non-commutative Probability. CRC Press, 2021.
- Terence Tao. Topics in Random Matrix Theory, volume 132. American Mathematical Society, 2023.
- Optimal embedding dimension for sparse subspace embeddings. arXiv preprint arXiv:2311.10680, 2023.
- High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
- API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013.
- UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.