Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Zero to Hero: How local curvature at artless initial conditions leads away from bad minima (2403.02418v2)

Published 4 Mar 2024 in cs.LG, cond-mat.dis-nn, and cond-mat.stat-mech

Abstract: We provide an analytical study of the evolution of the Hessian during gradient descent dynamics, and relate a transition in its spectral properties to the ability of finding good minima. We focus on the phase retrieval problem as a case study for complex loss landscapes. We first characterize the high-dimensional limit where both the number $M$ and the dimension $N$ of the data are going to infinity at fixed signal-to-noise ratio $\alpha = M/N$. For small $\alpha$, the Hessian is uninformative with respect to the signal. For $\alpha$ larger than a critical value, the Hessian displays at short-times a downward direction pointing towards good minima. While descending, a transition in the spectrum takes place: the direction is lost and the system gets trapped in bad minima. Hence, the local landscape is benign and informative at first, before gradient descent brings the system into a uninformative maze. Through both theoretical analysis and numerical experiments, we show that this dynamical transition plays a crucial role for finite (even very large) $N$: it allows the system to recover the signal well before the algorithmic threshold corresponding to the $N\rightarrow\infty$ limit. Our analysis sheds light on this new mechanism that facilitates gradient descent dynamics in finite dimensions, and highlights the importance of a good initialization based on spectral properties for optimization in complex high-dimensional landscapes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Star-shaped space of solutions of the spherical negative perceptron. Phys. Rev. Lett., 131:227301, Nov 2023. doi: 10.1103/PhysRevLett.131.227301. URL https://link.aps.org/doi/10.1103/PhysRevLett.131.227301.
  2. Escaping mediocrity: how two-layer networks learn hard single-index models with sgd. 5 2023. URL http://arxiv.org/abs/2305.18502.
  3. Online stochastic gradient descent on non-convex losses from high-dimensional inference, 2021.
  4. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33:1643–1697, 2005. ISSN 00911798. doi: 10.1214/009117905000000233.
  5. Comparing dynamics: deep neural networks versus glassy systems. Journal of Statistical Mechanics: Theory and Experiment, 12:124013, 2019. ISSN 02017563. doi: 10.1088/1742-5468/ab3281. URL https://ui.adsabs.harvard.edu/abs/2019JSMTE..12.4013B.
  6. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019.
  7. To understand deep learning we need to understand kernel learning. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  541–549. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/belkin18a.html.
  8. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Q38D6xxrKHe.
  9. Learning single-index models with shallow neural networks. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=wt7cd9m2cz2.
  10. On single index models beyond gaussian data, 2023.
  11. Solving phase retrieval with random initial guess is nearly as good as by spectral initialization. Applied and Computational Harmonic Analysis, 58:60–84, 2022. URL http://arxiv.org/abs/2101.03540.
  12. The global landscape of phase retrieval ii: quotient intensity models. arXiv e-prints, pp.  1–41, 2021. URL http://arxiv.org/abs/2112.07997.
  13. Nearly optimal bounds for the global geometric landscape of phase retrieval. IOP Publishing, 39:075011, 2023. doi: 10.1088/1361-6420/acdab7. URL http://arxiv.org/abs/2204.09416http://dx.doi.org/10.1088/1361-6420/acdab7.
  14. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61:1985–2007, 2015. ISSN 00189448. doi: 10.1109/TIT.2015.2399924.
  15. Spin-glass theory for pedestrians. Journal of Statistical Mechanics: Theory and Experiment, pp.  215–266, 2005. ISSN 17425468. doi: 10.1088/1742-5468/2005/05/P05012.
  16. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Communications on Pure and Applied Mathematics, 70:822–883, 2017. ISSN 10970312. doi: 10.1002/cpa.21638.
  17. Fienup, J. R. Phase retrieval for image reconstruction. pp.  CM1A.1. Optica Publishing Group, 2019. doi: 10.1364/COSI.2019.CM1A.1. URL http://opg.optica.org/abstract.cfm?URI=COSI-2019-CM1A.1.
  18. Universality of the sat-unsat (jamming) threshold in non-convex continuous constraint satisfaction problems. SciPost Physics, 2:1–37, 2017. ISSN 25424653. doi: 10.21468/SciPostPhys.2.3.019.
  19. Fyodorov, Y. V. Complexity of random energy landscapes, glass transition, and absolute value of spectral determinant of random matrices. Physical Review Letters, 93:149901–149901, 2004. ISSN 00319007. doi: 10.1103/PhysRevLett.93.149901.
  20. An investigation into neural net optimization via hessian eigenvalue density. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2232–2241. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ghorbani19b.html.
  21. Harrison, R. W. Phase problem in crystallography. Journal of the Optical Society of America Part A, 10:1046–1055, 5 1993. doi: 10.1364/JOSAA.10.001046. URL http://opg.optica.org/josaa/abstract.cfm?URI=josaa-10-5-1046.
  22. Toward the optimal construction of a loss function without spurious local minima for solving quadratic equations. IEEE Transactions on Information Theory, 66:3242–3260, 2020. ISSN 15579654. doi: 10.1109/TIT.2019.2956922.
  23. Bad global minima exist and sgd can reach them. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  8543–8552. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/618491e20a9b686b79e158c293ab4f91-Paper.pdf.
  24. Phase transitions of spectral initialization for high-dimensional non-convex estimation. Information and Inference: A Journal of the IMA, 9:507–541, 2020. ISSN 2049-8772. doi: 10.1093/imaiai/iaz020.
  25. A composite initialization method for phase retrieval. Symmetry, 13, 2021. ISSN 20738994. doi: 10.3390/sym13112006. URL https://www.mdpi.com/2073-8994/13/11/2006/htm.
  26. Optimal spectral initialization for signal recovery with applications to phase retrieval. IEEE Transactions on Signal Processing, 67:2347–2356, 2019. doi: 10.1109/TSP.2019.2904918. URL https://arxiv.org/abs/1811.04420.
  27. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pp.  3325–3334. PMLR, 2018.
  28. Landscape complexity for the empirical risk of generalized linear models. 107:287–327, 2019. URL http://arxiv.org/abs/1912.02143.
  29. Who is afraid of big bad minima? analysis of gradient-flow in a spiked matrix-tensor model. Advances in Neural Information Processing Systems, 32:1–28, 2019a. ISSN 10495258.
  30. Passed and spurious: Descent algorithms and local minima in spiked matrix-tensor models. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  4333–4342. PMLR, 09–15 Jun 2019b. URL https://proceedings.mlr.press/v97/mannelli19a.html.
  31. Complex dynamics in simple neural networks: Understanding gradient flow in phase retrieval. Advances in Neural Information Processing Systems, pp.  1–17, 2020a. ISSN 10495258.
  32. Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10:1–45, 2020b. ISSN 21603308. doi: 10.1103/PhysRevX.10.011057.
  33. Optimization and generalization of shallow neural networks with quadratic activation functions. Advances in Neural Information Processing Systems, 2020-Decem:1–26, 2020c. ISSN 10495258.
  34. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1:457, 1967. doi: 10.1070/SM1967v001n04ABEH001994. URL https://documents.epfl.ch/groups/i/ip/ipg/www/2011-2012/Random_Matrices_and_Communication_Systems/marchenko_pastur.pdf.
  35. On the impact of overparameterization on the training of a shallow neural network in high dimensions. arXiv preprint arXiv:2311.03794, 2023.
  36. Extending x-ray crystallography to allow the imaging of noncrystalline materials, cells, and single protein complexes. Annual Review of Physical Chemistry, 59:387–410, 2008. ISSN 0066426X. doi: 10.1146/annurev.physchem.59.032607.093642.
  37. Stochasticity helps to navigate rough landscapes: Comparing gradient-descent-based algorithms in the phase retrieval problem. Machine Learning: Science and Technology, 2, 2021. ISSN 26322153. doi: 10.1088/2632-2153/ac0615.
  38. Millane, R. P. Phase retrieval in crystallography and optics. Journal of the Optical Society of America Part A, 7:394–411, 3 1990. doi: 10.1364/JOSAA.7.000394. URL http://opg.optica.org/josaa/abstract.cfm?URI=josaa-7-3-394.
  39. Fundamental limits of weak recovery with applications to phase retrieval. Foundations of Computational Mathematics, 19:703–773, 2019. ISSN 16153383. doi: 10.1007/s10208-018-9395-y.
  40. Phase retrieval using alternating minimization. IEEE Transactions on Signal Processing, 63:4814–4826, 2015. ISSN 1053587X. doi: 10.1109/TSP.2015.2448516. URL https://arxiv.org/abs/1306.0160.
  41. Exploring generalization in deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  5949–5958, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  42. Quadratic programming with one negative eigenvalue is np-hard. Journal of Global Optimization, 1:15–22, 1991. ISSN 09255001. doi: 10.1007/BF00120662. URL https://doi.org/10.1007/BF00120662.
  43. Péché, S. Non-white wishart ensembles. Journal of Multivariate Analysis, 97:874–894, 2006. ISSN 0047259X. doi: 10.1016/j.jmva.2005.09.001.
  44. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions. Physical Review X, 9:11003, 2019. ISSN 21603308. doi: 10.1103/PhysRevX.9.011003. URL https://doi.org/10.1103/PhysRevX.9.011003.
  45. Phase retrieval with application to optical imaging. arXiv e-prints, pp.  1–25, 2014. URL http://arxiv.org/abs/1402.7350.
  46. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
  47. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18:1131–1198, 2018. ISSN 16153383. doi: 10.1007/s10208-017-9365-9.
  48. Sun, R.-Y. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
  49. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1–34, 2019. URL http://jmlr.org/papers/v20/18-674.html.
  50. Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming, 149:47–81, 2015. ISSN 14364646. doi: 10.1007/s10107-013-0738-9.
  51. Solving large-scale systems of random quadratic equations via stochastic truncated amplitude flow. 25th European Signal Processing Conference, EUSIPCO 2017, 2017-Janua:1420–1424, 2017a. doi: 10.23919/EUSIPCO.2017.8081443.
  52. Solving most systems of random quadratic equations. Advances in Neural Information Processing Systems, 2017-Decem:1868–1878, 2017b. ISSN 10495258.
  53. Phase retrieval and design with automatic differentiation: tutorial. Journal of the Optical Society of America B, 38:2465, 2021. ISSN 0740-3224. doi: 10.1364/josab.432723.
  54. Adahessian: An adaptive second order optimizer for machine learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10665–10673, May 2021. doi: 10.1609/aaai.v35i12.17275. URL https://ojs.aaai.org/index.php/AAAI/article/view/17275.
  55. Zamponi, F. Mean-field theory of spin-glasses. arXiv e-prints, 2010. URL https://arxiv.org/abs/1008.4844.
  56. Two-step phase retrieval algorithm using single-intensity measurement. International Journal of Optics, 2018, 2018. ISSN 16879392. doi: 10.1155/2018/8643819.
  57. A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms. Journal of Machine Learning Research, 18:1–35, 2017. ISSN 15337928. URL https://jmlr.org/papers/v18/16-572.html.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com