Sketched Ridgeless Linear Regression: The Role of Downsampling (2302.01088v2)
Abstract: Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the sketched ridgeless least square estimator. Our findings challenge conventional beliefs by showing that downsampling does not always harm generalization but can actually improve it in certain cases. We identify the optimal sketching size that minimizes out-of-sample prediction risks and demonstrate that the optimally sketched estimator exhibits stabler risk curves, eliminating the peaks of those for the full-sample estimator. To facilitate practical implementation, we propose an empirical procedure to determine the optimal sketching size. Finally, we extend our analysis to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
- Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp. 557–563, 2006.
- Generalization of two-layer neural networks: An asymptotic viewpoint. In Proceedings of the seventh International Conference on Learning Representations, 2019.
- No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. The Annals of Probability, 26(1):316–345, 1998.
- Spectral Analysis of Large Dimensional Random Matrices. Springer, New York, 2010.
- Central limit theorems for eigenvalues in a spiked population model. Annales de l’IHP Probabilités et Statistiques, 44(3):447–474, 2008.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Reconciling modern machine learning practice and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
- An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
- Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Journal of Machine Learning Research, 22(1):5721–5750, 2021.
- Analysis of the limiting spectral measure of large random matrices of the separable covariance type. Random Matrices: Theory and Applications, 3(04):1450016, 2014.
- Random Matrix Methods for Machine Learning. Cambridge University Press, Cambridge, 2022.
- Asymptotics for sketching in least squares regression. arXiv preprint arXiv:1810.06089, 2018.
- High-dimensional asymptotics of prediction: ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
- Lectures on randomized numerical linear algebra. The Mathematics of Data, 25(1), 2018.
- El Karoui, N. Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond. The Annals of Applied Probability, 19(6):2362–2405, 2009.
- Matrix Computations. Johns Hopkins University Press, Baltimore, 2013.
- Characterizing implicit bias in terms of optimization geometry. In Proceedings of the thirty-fifth International Conference on Machine Learning, pp. 1832–1841. PMLR, 2018.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2009.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
- Deep residual learning for image recognition. In Proceedings of the twenty-ninth IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352, 2017.
- Asymptotic normality and confidence intervals for prediction risk of the min-norm least squares estimator. In Proceedings of the thirty-eighth International Conference on Machine Learning, pp. 6533–6542. PMLR, 2021.
- Just interpolate: kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020.
- Interpolating classifiers make few mistakes. arXiv preprint arXiv:2101.11815, 2021.
- Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
- Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- Mezzadri, F. How to generate random matrices from the classical compact groups. arXiv preprint math-ph/0609050, 2006.
- Classification vs regression in overparameterized regimes: Does the loss function matter? Journal of Machine Learning Research, 22(1):10104–10172, 2021.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
- No eigenvalues outside the support of the limiting empirical spectral distribution of a separable covariance matrix. Journal of Multivariate Analysis, 100(1):37–57, 2009.
- Pilanci, M. Fast Randomized Algorithms for Convex Optimization and Statistical Estimation. PhD thesis, University of California, Berkeley, 2016.
- A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research, 17(1):7508–7538, 2016.
- Asymptotics of ridge (less) regression under general source condition. In Proceedings of the twenty-fourth International Conference on Artificial Intelligence and Statistics, pp. 3889–3897. PMLR, 2021.
- Woodruff, D. P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.
- Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge University Press, Cambridge, 2015.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Zhang, L. Spectral Analysis of Large Dimensional Random Matrices. PhD thesis, National University of Singapore, Singapore, 2007.
- Substitution principle for clt of linear spectral statistics of high-dimensional sample covariance matrices with applications to hypothesis testing. The Annals of Statistics, 43(2):546–591, 2015.