Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online stochastic gradient descent on non-convex losses from high-dimensional inference (2003.10409v4)

Published 23 Mar 2020 in stat.ML, cs.LG, math.PR, math.ST, and stat.TH

Abstract: Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining non-trivial correlation, the descent is rapid and exhibits law of large numbers type behavior. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, and parameter estimation for generalized linear models, online PCA, and spiked tensor models, as well as to supervised learning for single-layer networks with general activation functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. An introduction to random matrices, volume 118 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2010.
  2. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3412–3421. Curran Associates, Inc., 2018.
  3. Exact asymptotics for phase retrieval and compressed sensing with random generative priors. In J. Lu and R. Ward, editors, Proceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107 of Proceedings of Machine Learning Research, pages 55–73, Princeton University, Princeton, NJ, USA, 20–24 Jul 2020. PMLR.
  4. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
  5. Algorithmic thresholds for tensor PCA. Annals of Probability, 48(4):2052–2087, 2020.
  6. Bounding flows for spherical spin glass dynamics. Communications in Mathematical Physics, 373(3):1011–1048, 2020.
  7. The landscape of the spiked tensor model. Comm. Pure Appl. Math., 72(11):2282–2330, 2019.
  8. Online learning versus offline learning. In Computational learning theory (Barcelona, 1995), volume 904 of Lecture Notes in Comput. Sci., pages 38–52. Springer, Berlin, 1995.
  9. M. Benaïm. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII, volume 1709 of Lecture Notes in Math., pages 1–68. Springer, Berlin, 1999.
  10. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.
  11. How to iron out rough landscapes and get optimal performances: Averaged gradient descent and its application to tensor pca. Journal of Physics A: Mathematical and Theoretical, 2020.
  12. C. M. Bishop. Pattern recognition and machine learning. Information Science and Statistics. Springer, New York, 2006.
  13. L. Bottou. On-Line Learning and Stochastic Approximations. Cambridge University Press, USA, 1999.
  14. L. Bottou. Stochastic learning. In Summer School on Machine Learning, pages 146–168. Springer, 2003.
  15. L. Bottou and Y. Le Cun. Large scale online learning. In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 217–224. MIT Press, 2004.
  16. Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inform. Theory, 61(4):1985–2007, 2015.
  17. Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Mathematical Programming, 176(1):5–37, 2019.
  18. Sharp convergence rates for langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018.
  19. Stochastic gradient and Langevin processes. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1810–1819. PMLR, 13–18 Jul 2020.
  20. Bridging the gap between constant step size stochastic gradient descent and markov chains. Ann. Statist., 48(3):1348–1382, 06 2020.
  21. M. Duflo. Algorithmes stochastiques, volume 23 of Mathématiques & Applications (Berlin) [Mathematics & Applications]. Springer-Verlag, Berlin, 1996.
  22. D. A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
  23. Escaping from saddle points — online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 797–842, Paris, France, 03–06 Jul 2015. PMLR.
  24. Deep learning. MIT press, 2016.
  25. Tight analyses for non-smooth stochastic gradient descent. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1579–1613, Phoenix, USA, 25–28 Jun 2019. PMLR.
  26. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
  27. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–191. ACM, 2016.
  28. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006, 2015.
  29. Statistical thresholds for tensor PCA. Ann. Appl. Probab., 30(4):1910–1933, 2020.
  30. H. Jeong and C. S. Güntürk. Convergence of the randomized kaczmarz method for phase retrieval. ArXiv, abs/1706.10291, 2017.
  31. I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295–327, 2001.
  32. Efficiently learning mixtures of two gaussians. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, New York, NY, USA, 2010. Association for Computing Machinery.
  33. Community detection in hypergraphs, spiked tensor models, and sum-of-squares. In Sampling Theory and Applications (SampTA), 2017 International Conference on, pages 124–128. IEEE, 2017.
  34. T. Krasulina. The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. USSR Computational Mathematics and Mathematical Physics, 9(6):189 – 195, 1969.
  35. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  36. M. Ledoux and M. Talagrand. Probability in Banach spaces. Classics in Mathematics. Springer-Verlag, Berlin, 2011. Isoperimetry and processes, Reprint of the 1991 edition.
  37. Statistical and computational phase transitions in spiked tensor estimation. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 511–515. IEEE, 2017.
  38. Online ica: Understanding global dynamics of nonconvex optimization via diffusion processes. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4967–4975. Curran Associates, Inc., 2016.
  39. L. Ljung. Analysis of recursive stochastic algorithms. IEEE Trans. Automatic Control, AC-22(4):551–575, 1977.
  40. Y. M. Lu and G. Li. Phase transitions of spectral initialization for high-dimensional nonconvex estimation. Information and Inference: A Journal of the IMA, to appear.
  41. Optimal Spectral Initialization for Signal Recovery With Applications to Phase Retrieval. IEEE Transactions on Signal Processing, 67(9):2347–2356, May 2019.
  42. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881–20885, 2019.
  43. Landscape complexity for the empirical risk of generalized linear models. In J. Lu and R. Ward, editors, Proceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107 of Proceedings of Machine Learning Research, pages 287–327, Princeton University, Princeton, NJ, USA, 20–24 Jul 2020. PMLR.
  44. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907, 2017.
  45. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020.
  46. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In Advances in Neural Information Processing Systems 32, pages 8679–8689. Curran Associates, Inc., 2019.
  47. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4333–4342, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  48. P. McCullagh and J. Nelder. Generalized Linear Models, Second Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1989.
  49. D. L. McLeish. Functional and random central limit theorems for the robbins-munro process. Journal of Applied Probability, 13(1), 1976.
  50. The landscape of empirical risk for nonconvex losses. Ann. Statist., 46(6A):2747–2774, 12 2018.
  51. M. Mondelli and A. Montanari. Fundamental limits of weak recovery with applications to phase retrieval. In Conference On Learning Theory, pages 1445–1450. PMLR, 2018.
  52. On the limitation of spectral methods: From the gaussian hidden clique problem to rank-one perturbations of gaussian tensors. In Advances in Neural Information Processing Systems, pages 217–225, 2015.
  53. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA, 2014. MIT Press.
  54. D. Needell and R. Ward. Batched stochastic gradient descent with weighted sampling. In G. E. Fasshauer and L. L. Schumaker, editors, Approximation Theory XV: San Antonio 2016, pages 279–306, Cham, 2017. Springer International Publishing.
  55. E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106(1):69 – 84, 1985.
  56. S. Péché. The largest eigenvalue of small rank perturbations of hermitian random matrices. Probability Theory and Related Fields, 134(1):127–173, Jan 2006.
  57. Optimality and sub-optimality of PCA I: Spiked random matrix models. Ann. Statist., 46(5):2416–2451, 2018.
  58. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. volume 65 of Proceedings of Machine Learning Research, pages 1674–1703, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
  59. E. Richard and A. Montanari. A statistical model for tensor pca. In Advances in Neural Information Processing Systems, pages 2897–2905, 2014.
  60. H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951.
  61. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions. Phys. Rev. X, 9:011003, Jan 2019.
  62. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
  63. P. Sur and E. J. Candès. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
  64. Near-optimal-sample estimators for spherical gaussian mixtures. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1395–1403. Curran Associates, Inc., 2014.
  65. Y. S. Tan and R. Vershynin. Phase retrieval via randomized Kaczmarz: theoretical guarantees. Information and Inference: A Journal of the IMA, 8(1):97–123, 04 2018.
  66. Y. S. Tan and R. Vershynin. Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval. arXiv preprint arXiv:1910.12837, 2019.
  67. R. Vershynin. High–Dimensional Probability. Cambridge University Press, 2019.
  68. M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory, 55(5):2183–2202, 2009.
  69. C. Wang and Y. M. Lu. Online learning for sparse pca in high dimensions: Exact dynamics and phase transitions. 2016 IEEE Information Theory Workshop (ITW), pages 186–190, 2016.
  70. Scaling limit: Exact and tractable analysis of online learning algorithms with applications to regularized regression and pca. arXiv preprint arXiv:1712.04332, 2017.
  71. The kikuchi hierarchy and tensor pca. 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468, 2019.
  72. H. Zhang and Y. Liang. Reshaped wirtinger flow for solving quadratic system of equations. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2622–2630. Curran Associates, Inc., 2016.
  73. A hitting time analysis of stochastic gradient langevin dynamics. volume 65 of Proceedings of Machine Learning Research, pages 1980–2022, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Reza Gheissari (38 papers)
  2. Aukosh Jagannath (37 papers)
  3. Gerard Ben Arous (18 papers)
Citations (70)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com