Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

Published 12 Apr 2024 in stat.ML, cond-mat.stat-mech, cs.LG, math.PR, math.ST, and stat.TH | (2404.08602v2)

Abstract: Neural networks extract features from data using stochastic gradient descent (SGD). In particular, higher-order input cumulants (HOCs) are crucial for their performance. However, extracting information from the $p$th cumulant of $d$-dimensional inputs is computationally hard: the number of samples required to recover a single direction from an order-$p$ tensor (tensor PCA) using online SGD grows as $d{p-1}$, which is prohibitive for high-dimensional inputs. This result raises the question of how neural networks extract relevant directions from the HOCs of their inputs efficiently. Here, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Maria Refinetti, Alessandro Ingrosso and Sebastian Goldt “Neural networks trained with SGD learn distributions of increasing complexity” In International Conference on Machine Learning, 2023, pp. 28843–28863 PMLR
  2. “A statistical model for tensor PCA” In Advances in neural information processing systems 27, 2014
  3. Gerard Ben Arous, Reza Gheissari and Aukosh Jagannath “Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference” In J. Mach. Learn. Res. 22.1 JMLR.org, 2021
  4. Alexandru Damian, Jason Lee and Mahdi Soltanolkotabi “Neural networks can learn representations with gradient descent” In Conference on Learning Theory, 2022, pp. 5413–5452 PMLR
  5. Emmanuel Abbé, Enric Boix Adserà and Theodor Misiakiewicz “SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics” In Proceedings of Thirty Sixth Conference on Learning Theory 195, Proceedings of Machine Learning Research PMLR, 2023, pp. 2552–2623 URL: https://proceedings.mlr.press/v195/abbe23a.html
  6. “Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks” In arXiv2312.14922, 2023
  7. “A nearly tight sum-of-squares lower bound for the planted clique problem” In SIAM Journal on Computing 48, 2019, pp. 687–735
  8. “The power of sum-of-squares for detecting hidden structures” In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 720–731
  9. Samuel B Hopkins and David Steurer “Bayesian estimation from few samples: com- munity detection and related problems” In arXiv:1710.00264, 2017
  10. Samuel Hopkins “Statistical inference and the sum of squares method”, 2018
  11. “Data-driven emergence of convolutional structure in neural networks” In Proceedings of the National Academy of Sciences 119.40 National Acad Sciences, 2022, pp. e2201854119
  12. “Learning Interacting Theories from Data” In Phys. Rev. X 13 American Physical Society, 2023, pp. 041033 DOI: 10.1103/PhysRevX.13.041033
  13. “Neural Networks Learn Statistics of Increasing Complexity” In arXiv:2402.04362, 2024
  14. Sandra Nestler, Moritz Helias and Matthieu Gilson “Statistical temporal pattern extraction by neuronal architecture” In Physical Review Research 5.3 APS, 2023, pp. 033177
  15. “Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models” In arXiv:2305.10633, 2023
  16. “Learning Two-Layer Neural Networks, One (Giant) Step at a Time” In arXiv:2305.18270, 2023
  17. “Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed” In International Conference on Machine Learning, 2021, pp. 8936–8947 PMLR
  18. Gerard Ben Arous, Reza Gheissari and Aukosh Jagannath “High-dimensional limit theorems for SGD: Effective dynamics and critical scaling” In Advances in Neural Information Processing Systems 35 Curran Associates, Inc., 2022, pp. 25349–25362
  19. “Exact Solution for On-Line Learning in Multilayer Neural Networks” In Phys. Rev. Lett. 74.21, 1995, pp. 4337–4340
  20. “On-line learning in soft committee machines” In Phys. Rev. E 52.4, 1995, pp. 4225–4243
  21. “SGD on neural networks learns functions of increasing complexity” In Advances in neural information processing systems 32, 2019
  22. “The staircase property: How hierarchical structure can guide deep learning” In Advances in Neural Information Processing Systems 34, 2021, pp. 26989–27002
  23. Emmanuel Abbé, Enric Boix Adsera and Theodor Misiakiewicz “The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks” In Conference on Learning Theory, 2022, pp. 4782–4887 PMLR
  24. “Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity” In arXiv preprint arXiv:2106.15933, 2021
  25. Etienne Boursier, Loucas Pillaud-Vivien and Nicolas Flammarion “Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs” In Advances in Neural Information Processing Systems 35, 2022, pp. 20105–20118
  26. “High-dimensional asymptotics of feature learning: How one gradient step improves the representation” In Advances in Neural Information Processing Systems 35, 2022, pp. 37932–37946
  27. Raphaël Berthier, Andrea Montanari and Kangjie Zhou “Learning time-scales in two-layers neural networks” In arXiv:2303.00055, 2023
  28. “Gradient-based feature learning under structured data” In Advances in Neural Information Processing Systems 36, 2024
  29. Rainer Dietrich, Manfred Opper and Haim Sompolinsky “Statistical mechanics of support vector networks” In Physical review letters 82.14 APS, 1999, pp. 2975
  30. “Limitations of Lazy Training of Two-layers Neural Network” In Advances in Neural Information Processing Systems 32, 2019, pp. 9111–9121
  31. “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020
  32. Blake Bordelon, Abdulkadir Canatar and Cengiz Pehlevan “Spectrum dependent learning curves in kernel regression and wide neural networks” In International Conference on Machine Learning, 2020, pp. 1024–1034 PMLR
  33. Stefano Spigler, Mario Geiger and Matthieu Wyart “Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm” In Journal of Statistical Mechanics: Theory and Experiment 2020.12 IOP Publishing, 2020, pp. 124001
  34. “Precise Learning Curves and Higher-Order Scalings for Dot-product Kernel Regression” In Advances in Neural Information Processing Systems 35, 2022, pp. 4558–4570
  35. Hugo Cui, Florent Krzakala and Lenka Zdeborová “Optimal learning of deep random networks of extensive-width” In arXiv:2302.00375, 2023
  36. “Convergence analysis of two-layer neural networks with relu activation” In Advances in neural information processing systems 30, 2017
  37. Arthur Jacot, Franck Gabriel and Clement Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018 URL: https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf
  38. “On exact computation with an infinitely wide neural net” In Advances in neural information processing systems 32, 2019
  39. “Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data” In Advances in Neural Information Processing Systems 31, 2018
  40. Lénaïc Chizat, Edouard Oyallon and Francis Bach “On Lazy Training in Differentiable Programming” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/ae614c557843b1df326cb29c57225459-Paper.pdf
  41. Jinho Baik, Gérard Ben Arous and Sandrine Péché “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices” In The Annals of Probability 33.5 Institute of Mathematical Statistics, 2005, pp. 1643–1697 DOI: 10.1214/009117905000000233
  42. “Tensor decompositions for learning latent variable models” In Journal of machine learning research 15 Journal of Machine Learning Research, 2014, pp. 2773–2832
  43. “Marvels and pitfalls of the langevin algorithm in noisy high-dimensional inference” In Physical Review X 10.1 APS, 2020, pp. 011057
  44. “Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models” In international conference on machine learning, 2019, pp. 4333–4342 PMLR
  45. “Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models” In Advances in Neural Information Processing Systems 32, 2019
  46. “The generalization error of random features regression: Precise asymptotics and the double descent curve” In Communications on Pure and Applied Mathematics 75.4 Wiley Online Library, 2022, pp. 667–766
  47. “Modeling the influence of data structure on learning in neural networks: The hidden manifold model” In Physical Review X 10.4 APS, 2020, pp. 041044
  48. “The gaussian equivalence of generative models for learning with shallow neural networks” In Mathematical and Scientific Machine Learning, 2022, pp. 426–471 PMLR
  49. “Generalisation error in learning with random features and the hidden manifold model” In International Conference on Machine Learning, 2020, pp. 3452–3462 PMLR
  50. Hong Hu and Yue M Lu “Universality laws for high-dimensional learning with random features” In IEEE Transactions on Information Theory 69.3 IEEE, 2022, pp. 1932–1964
  51. “Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation” In arXiv:2302.08923, 2023
  52. “The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents” In arXiv preprint arXiv:2402.03220, 2024
  53. “The Computational Complexity of Learning Gaussian Single-Index Models” In arXiv preprint arXiv:2403.05529, 2024
  54. “Decomposing neural networks as mappings of correlation functions” In Physical Review Research 4.4 APS, 2022, pp. 043143
  55. Dmitriy Kunisky, Alexander S Wein and Afonso S Bandeira “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio” In ISAAC Congress (International Society for Analysis, its Applications and Computation), 2019, pp. 1–50 Springer
  56. “On-line backpropagation in two-layered neural networks” In Journal of Physics A: Mathematical and General 28.20 IOP Publishing, 1995, pp. L507
Citations (5)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.