Improving Implicit Regularization of SGD with Preconditioning for Least Square Problems (2403.08585v3)
Abstract: Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice and plays an important role in the generalization of modern machine learning. However, prior research has revealed instances where the generalization performance of SGD is worse than ridge regression due to uneven optimization along different dimensions. Preconditioning offers a natural solution to this issue by rebalancing optimization across different directions. Yet, the extent to which preconditioning can enhance the generalization performance of SGD and whether it can bridge the existing gap with ridge regression remains uncertain. In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem. We make a comprehensive comparison between preconditioned SGD and (standard & preconditioned) ridge regression. Our study makes several key contributions toward understanding and improving SGD with preconditioning. First, we establish excess risk bounds (generalization performance) for preconditioned SGD and ridge regression under an arbitrary preconditions matrix. Second, leveraging the excessive risk characterization of preconditioned SGD and ridge regression, we show that (through construction) there exists a simple preconditioned matrix that can make SGD comparable to (standard & preconditioned) ridge regression. Finally, we show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples while maintaining a theoretical improvement. Our empirical results align with our theoretical findings, collectively showcasing the enhanced regularization effect of preconditioned SGD.
- A continuous-time view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics, pp. 1370–1378. PMLR, 2019.
- The implicit regularization of stochastic gradient flow for least squares. In International conference on machine learning, pp. 233–244. PMLR, 2020.
- When does preconditioning help or hurt generalization? arXiv preprint arXiv:2006.10732, 2020.
- Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.
- Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017.
- Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952, 2018.
- Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). Advances in neural information processing systems, 26, 2013.
- Raghu Raj Bahadur. Rates of convergence of estimates and test statistics. The Annals of Mathematical Statistics, 38(2):303–324, 1967.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Michele Benzi. Preconditioning techniques for large linear systems: a survey. Journal of computational Physics, 182(2):418–477, 2002.
- Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):1–45, 2017.
- Can implicit bias explain generalization? stochastic convex optimization as a case study. Advances in Neural Information Processing Systems, 33:7743–7753, 2020.
- Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pp. 205–213. PMLR, 2015.
- Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
- High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
- Solving ridge regression using sketched preconditioned svrg. In International conference on machine learning, pp. 1397–1405. PMLR, 2016.
- (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.
- Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pp. 1832–1841. PMLR, 2018.
- Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pp. 1842–1850. PMLR, 2018.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
- Random design analysis of ridge regression. In Twenty-Fifth Annual Conference on Learning Theory, 2012.
- A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017.
- Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18(223):1–42, 2018.
- Meta-learning with a geometry-adaptive preconditioner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16080–16090, June 2023.
- The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. The Journal of Machine Learning Research, 21(1):6863–6878, 2020.
- Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Xi-Lin Li. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5):1454–1466, 2017.
- Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. Advances in Neural Information Processing Systems, 35:34626–34640, 2022.
- Andreas Loukas. How close are the eigenvectors of the sample and actual covariance matrices? In International Conference on Machine Learning, pp. 2228–2237. PMLR, 2017.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
- Implicit regularization in deep learning may not be explainable by norms. Advances in neural information processing systems, 33:21174–21187, 2020.
- Sgd: The role of implicit regularization, batch-size and multiple-epochs. Advances In Neural Information Processing Systems, 34:27422–27433, 2021.
- On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=rq_Qr0c1Hyo.
- Connecting optimization and regularization paths. Advances in Neural Information Processing Systems, 31, 2018.
- Recent advances in stochastic gradient descent in deep learning. Mathematics, 11(3):682, 2023.
- Andrei N Tikhonov. Solution of incorrectly formulated problems and the regularization method. Sov Dok, 4:1035–1038, 1963.
- Benign overfitting in ridge regression. J. Mach. Learn. Res., 24:123–1, 2023.
- David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
- Denny Wu and Ji Xu. On the optimal weighted \ell_2\absent𝑒𝑙𝑙_2\backslash ell\_2\ italic_e italic_l italic_l _ 2 regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123, 2020.
- Direction matters: On the implicit bias of stochastic gradient descent with moderate learning rate. arXiv preprint arXiv:2011.02538, 2020.
- Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. Advances in neural information processing systems, 32, 2019.
- The power of preconditioning in overparameterized low-rank matrix sensing, 2023.
- Understanding deep learning requires rethinking generalization, 2017.
- Towards theoretically understanding why sgd generalizes better than adam in deep learning. Advances in Neural Information Processing Systems, 33:21285–21296, 2020.
- The benefits of implicit regularization from sgd in least squares problems. Advances in neural information processing systems, 34:5456–5468, 2021a.
- Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pp. 4633–4635. PMLR, 2021b.
- Benign overfitting of constant-stepsize sgd for linear regression. Journal of Machine Learning Research, 24(326):1–58, 2023.