Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spatially heterogeneous learning by a deep student machine (2302.07419v4)

Published 15 Feb 2023 in cond-mat.dis-nn, cond-mat.stat-mech, cs.LG, and stat.ML

Abstract: Deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNN, we study supervised learning by a DNN of width $N$ and depth $L$ consisting of $NL$ perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce $M$ sets of $N$ dimensional input/output relations provided by a teacher machine. We show that the problem becomes exactly solvable in what we call as 'dense limit': $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$ using the replica method developed in (H. Yoshino, (2020)). We also study the model numerically performing simple greedy MC simulations. Simulations reveal that learning by the DNN is quite heterogeneous in the network space: configurations of the teacher and the student machines are more correlated within the layers closer to the input/output boundaries while the central region remains much less correlated due to the over-parametrization in qualitative agreement with the theoretical prediction. We evaluate the generalization-error of the DNN with various depth $L$ both theoretically and numerically. Remarkably both the theory and simulation suggest generalization-ability of the student machines, which are only weakly correlated with the teacher in the center, does not vanish even in the deep limit $L \gg 1$ where the system becomes heavily over-parametrized. We also consider the impact of effective dimension $D(\leq N)$ of data by incorporating the hidden manifold model (S. Goldt et. al., (2020)) into our model. The theory implies that the loop corrections to the dense limit become enhanced by either decreasing the width $N$ or decreasing the effective dimension $D$ of the data. Simulation suggests both lead to significant improvements in generalization-ability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Hajime Yoshino, “From complex to simple: hierarchical free-energy landscape renormalized in deep neural networks,” SciPost Phys. Core 2, 005 (2020).
  2. Sebastian Goldt, Marc Mézard, Florent Krzakala,  and Lenka Zdeborová, “Modeling the influence of data structure on learning in neural networks: The hidden manifold model,” Physical Review X 10, 041044 (2020).
  3. Yann LeCun, Yoshua Bengio,  and Geoffrey Hinton, “Deep learning,” nature 521, 436 (2015).
  4. Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler,  and Matthieu Wyart, “Scaling description of generalization with number of parameters in deep learning,” Journal of Statistical Mechanics: Theory and Experiment 2020, 023401 (2020).
  5. Song Mei and Andrea Montanari, “The generalization error of random features regression: Precise asymptotics and the double descent curve,” Communications on Pure and Applied Mathematics 75, 667–766 (2022).
  6. Bruno Loureiro, Cédric Gerbelot, Maria Refinetti, Gabriele Sicuro,  and Florent Krzakala, “Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in high-dimension,” in International Conference on Machine Learning (PMLR, 2022) pp. 14283–14314.
  7. Stéphane d’Ascoli, Levent Sagun,  and Giulio Biroli, “Triple descent and the two kinds of overfitting: where and why do they appear?” Journal of Statistical Mechanics: Theory and Experiment 2021, 124002 (2021).
  8. Daniel J Amit, Hanoch Gutfreund,  and Haim Sompolinsky, “Spin-glass models of neural networks,” Physical Review A 32, 1007 (1985).
  9. Elizabeth Gardner, “The space of interactions in neural network models,” Journal of physics A: Mathematical and general 21, 257 (1988).
  10. Elizabeth Gardner and Bernard Derrida, “Three unfinished works on the optimal storage capacity of networks,” Journal of Physics A: Mathematical and General 22, 1983 (1989).
  11. Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard,  and Lenka Zdeborová, “Generalisation error in learning with random features and the hidden manifold model,” in International Conference on Machine Learning (PMLR, 2020) pp. 3452–3462.
  12. Benjamin Aubin, Bruno Loureiro, Antoine Maillard, Florent Krzakala,  and Lenka Zdeborová, “The spiked matrix model with generative priors,” Advances in Neural Information Processing Systems 32 (2019).
  13. Dominik Schröder, Hugo Cui, Daniil Dmitriev,  and Bruno Loureiro, “Deterministic equivalent and error universality of deep random features learning,” arXiv preprint arXiv:2302.00401  (2023), 10.48550/arXiv.2302.00401.
  14. Hugo Cui, Florent Krzakala,  and Lenka Zdeborová, “Optimal learning of deep random networks of extensive-width,” arXiv preprint arXiv:2302.00375  (2023), 10.48550/arXiv.2302.00375.
  15. Andreas Engel and Christian Van den Broeck, Statistical mechanics of learning (Cambridge University Press, 2001).
  16. Lenka Zdeborová and Florent Krzakala, “Statistical physics of inference: Thresholds and algorithms,” Advances in Physics 65, 453–552 (2016).
  17. Arthur Jacot, Franck Gabriel,  and Clément Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” Advances in neural information processing systems 31 (2018).
  18. Song Mei, Andrea Montanari,  and Phan-Minh Nguyen, “A mean field view of the landscape of two-layer neural networks,” Proceedings of the National Academy of Sciences 115, E7665–E7671 (2018).
  19. Lenaic Chizat and Francis Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,” Advances in neural information processing systems 31 (2018).
  20. Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mézard,  and Lenka Zdeborová, “The gaussian equivalence of generative models for learning with shallow neural networks,” in Mathematical and Scientific Machine Learning (PMLR, 2022) pp. 426–471.
  21. David H Ackley, Geoffrey E Hinton,  and Terrence J Sejnowski, “A learning algorithm for boltzmann machines,” Cognitive science 9, 147–169 (1985).
  22. Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein,  and Surya Ganguli, “Exponential expressivity in deep neural networks through transient chaos,” Advances in Neural Information Processing Systems 29,  , 3360–3368 (2016).
  23. Alan J Bray and Michael A Moore, “Chaotic nature of the spin-glass phase,” Physical review letters 58, 57 (1987).
  24. Ch M Newman and DL Stein, “Multiple states and thermodynamic limits in short-ranged ising spin-glass models,” Physical Review B 46, 973 (1992).
  25. Yukito Iba, “The nishimori line and bayesian statistics,” Journal of Physics A: Mathematical and General 32, 3875 (1999).
  26. Rémi Monasson and Riccardo Zecchina, “Weight space structure and internal representations: a direct approach to learning and generalization in multilayer neural networks,” Physical review letters 75, 2432 (1995).
  27. Esther Levin, Naftali Tishby,  and Sara A Solla, “A statistical approach to learning and generalization in layered neural networks,” Proceedings of the IEEE 78, 1568–1574 (1990).
  28. Manfred Opper and Wolfgang Kinzel, “Statistical mechanics of generalization,” in Models of neural networks III (Springer, 1996) pp. 151–209.
  29. Hidetoshi Nishimori, Statistical physics of spin glasses and information processing: an introduction, 111 (Clarendon Press, 2001).
  30. S. Franz and G. Parisi, “Recipes for metastable states in spin glasses,” Journal de Physique I 5, 1401–1415 (1995).
  31. Pierre-Gilles De Gennes, “Wetting: statics and dynamics,” Reviews of modern physics 57, 827 (1985).
  32. Florent Krzakala and Lenka Zdeborová, “On melting dynamics and the glass transition. i. glassy aspects of melting dynamics,” The Journal of chemical physics 134, 034512 (2011a).
  33. Florent Krzakala and Lenka Zdeborová, “On melting dynamics and the glass transition. ii. glassy dynamics as a melting process,” The Journal of chemical physics 134, 034513 (2011b).
  34. Géza Györgyi, “First-order transition to perfect generalization in a neural network with binary synapses,” Physical Review A 41, 7097 (1990).
  35. K. Hukushima and H. Kawamura, “Chiral-glass transition and replica symmetry breaking of a three-dimensional heisenberg spin glass,” Phys. Rev. E 61, R1008–R1011 (2000).
  36. Simon Kornblith, Mohammad Norouzi, Honglak Lee,  and Geoffrey Hinton, “Similarity of neural network representations revisited,” in International Conference on Machine Learning (PMLR, 2019) pp. 3519–3529.
  37. Wenxuan Zou and Haiping Huang, “Data-driven effective model shows a liquid-like deep learning,” Physical Review Research 3, 033290 (2021).
  38. Giorgio Parisi and Miguel Angel Virasoro, “On a mechanism for explicit replica symmetry breaking,” Journal de Physique 50, 3317–3329 (1989).
  39. Timm Plefka, “Convergence condition of the tap equation for the infinite-ranged ising spin glass model,” Journal of Physics A: Mathematical and general 15, 1971 (1982).
  40. Jean-Pierre Hansen and Ian R McDonald, Theory of simple liquids (Elsevier, 1990).
  41. Jean Zinn-Justin, Quantum field theory and critical phenomena, Vol. 171 (Oxford university press, 2021).
Citations (1)

Summary

We haven't generated a summary for this paper yet.