Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 32 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Critical feature learning in deep neural networks (2405.10761v1)

Published 17 May 2024 in cond-mat.dis-nn

Abstract: A key property of neural networks driving their success is their ability to learn features from data. Understanding feature learning from a theoretical viewpoint is an emerging field with many open questions. In this work we capture finite-width effects with a systematic theory of network kernels in deep non-linear neural networks. We show that the Bayesian prior of the network can be written in closed form as a superposition of Gaussian processes, whose kernels are distributed with a variance that depends inversely on the network width N . A large deviation approach, which is exact in the proportional limit for the number of data points $P = \alpha N \rightarrow \infty$, yields a pair of forward-backward equations for the maximum a posteriori kernels in all layers at once. We study their solutions perturbatively to demonstrate how the backward propagation across layers aligns kernels with the target. An alternative field-theoretic formulation shows that kernel adaptation of the Bayesian posterior at finite-width results from fluctuations in the prior: larger fluctuations correspond to a more flexible network prior and thus enable stronger adaptation to data. We thus find a bridge between the classical edge-of-chaos NNGP theory and feature learning, exposing an intricate interplay between criticality, response functions, and feature scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. 10.5281/zenodo.11205498. URL https://doi.org/10.5281/zenodo.11205498.
  2. On the asymptotics of wide networks with polynomial activations, 2020. URL https://arxiv.org/abs/2006.06687.
  3. Antognini, J. M. Finite size corrections for neural network gaussian processes. 2019.
  4. Predictive power of a bayesian effective action for fully-connected one hidden layer neural networks in the proportional limit. (arXiv:2401.11004), January 2024. URL http://arxiv.org/abs/2401.11004. arXiv:2401.11004 [cond-mat].
  5. On the opportunities and risks of foundation models. (arXiv:2108.07258), July 2022. URL http://arxiv.org/abs/2108.07258. arXiv:2108.07258 [cs].
  6. Self-consistent dynamical field theory of kernel evolution in wide neural networks*. Journal of Statistical Mechanics: Theory and Experiment, 2023(11):114009, nov 2023. doi: 10.1088/1742-5468/ad01b0. URL https://dx.doi.org/10.1088/1742-5468/ad01b0.
  7. Criticality versus uniformity in deep neural networks. (arXiv:2304.04784), April 2023. URL http://arxiv.org/abs/2304.04784. arXiv:2304.04784 [cs, stat].
  8. Initialization of relus for dynamical isometry. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/d9731321ef4e063ebbee79298fa36f56-Paper.pdf.
  9. A kernel analysis of feature learning in deep neural networks. In 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.  1–8, 2022. doi: 10.1109/Allerton49937.2022.9929375.
  10. On lazy training in differentiable programming. In Advances in neural information processing systems, volume 32, 2019. URL https://openreview.net/pdf?id=rkgxDVSlLB.
  11. Learning curves for overparametrized deep neural networks: A field theory perspective. 3:023034, 2021. doi: 10.1103/PhysRevResearch.3.023034.
  12. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(28):795–828, 2012. URL http://jmlr.org/papers/v13/cortes12a.html.
  13. Bayes-optimal learning of deep random networks of extensive-width. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  6468–6521. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/cui23b.html.
  14. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1gFvANKDS.
  15. Optimal signal propagation in resnets through residual scaling. (arXiv:2305.07715), May 2023. URL http://arxiv.org/abs/2305.07715. arXiv:2305.07715 [cond-mat, stat].
  16. Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bklfsi0cKm.
  17. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, November 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/abc4de.
  18. Landscape and training regimes in deep learning. 924:1–18, 2021. doi: 10.1016/j.physrep.2021.04.001.
  19. The Gaussian equivalence of generative models for learning with two-layer neural networks. 2020. URL https://arxiv.org/abs/2006.14709v1.
  20. Neural networks and quantum field theory. Machine Learning: Science and Technology, 2(3):035002, apr 2021. doi: 10.1088/2632-2153/abeca3. URL https://doi.org/10.1088/2632-2153/abeca3.
  21. Bayesian interpolation with deep linear networks. 120:e2301345120, June 2023. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.2301345120.
  22. Infinite attention: NNGP and NTK for deep attention networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  4376–4386. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/hron20a.html.
  23. Dynamics of deep neural networks and neural tangent hierarchy. In International Conference on Machine Learning, pp. 4542–4551. PMLR, 2020.
  24. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, pp. 8580–8589, 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  25. A simple weight decay can improve generalization. In Moody, J., Hanson, S., and Lippmann, R. (eds.), Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991. URL https://proceedings.neurips.cc/paper_files/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf.
  26. The mnist database of handwritten digits, 1998.
  27. Deep neural networks as gaussian processes. pp.  1711.00165, 2017.
  28. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
  29. Wide neural networks of any depth evolve as linear models under gradient descent. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/0d1a9651497a38d8b1c3871c84528bd4-Paper.pdf.
  30. Finite versus infinite neural networks: an empirical study. volume 33, pp.  15156–15172. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ad086f59924fffe0773f8d0ca22ea712-Paper.pdf.
  31. Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization. 11(3):031059, 2021. doi: 10.1103/PhysRevX.11.031059.
  32. A theory of data variability in neural network bayesian inference. (arXiv:2307.16695), November 2023. URL http://arxiv.org/abs/2307.16695. arXiv:2307.16695 [cond-mat, stat].
  33. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022. doi: https://doi.org/10.1002/cpa.22008. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.22008.
  34. Suppressing chaos in neural networks by noise. 69(26):3717, 1992. doi: 10.1103/PhysRevLett.69.3717.
  35. A self consistent theory of gaussian processes captures feature learning effects in finite CNNs. 2021. URL https://openreview.net/forum?id=vBYwwBxVcsE.
  36. Predicting the outputs of finite networks trained with noisy gradients. 2020.
  37. Predicting the outputs of finite deep neural networks trained with noisy gradients. 104:064301, Dec 2021. doi: 10.1103/PhysRevE.104.064301. URL https://link.aps.org/doi/10.1103/PhysRevE.104.064301.
  38. Neal, R. M. Bayesian Learning for Neural Networks. Springer New York, 1996. doi: 10.1007/978-1-4612-0745-0. URL https://doi.org/10.1007/978-1-4612-0745-0.
  39. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1g30j0qF7.
  40. A statistical mechanics framework for bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, December 2023. ISSN 2522-5839. doi: 10.1038/s42256-023-00767-6.
  41. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, Boston, 4th edition, 2002.
  42. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. 2017. URL https://arxiv.org/abs/1711.04735.
  43. Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314), October 2022. URL http://arxiv.org/abs/2206.12314. arXiv:2206.12314 [cs, stat].
  44. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems 29. 2016. URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
  45. Price, R. A useful theorem for nonlinear devices having gaussian inputs. 4(2):69–72, 1958.
  46. Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8936–8947. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/refinetti21b.html.
  47. Risken, H. The Fokker-Planck Equation. Springer Verlag Berlin Heidelberg, 1996. doi: 10.1007/978-3-642-61544-3˙4. URL https://doi.org/10.1007/978-3-642-61544-3_4.
  48. The Principles of Deep Learning Theory. Cambridge University Press, May 2022. doi: 10.1017/9781009023405. URL https://doi.org/10.1017/9781009023405.
  49. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Represenatations, 2014.
  50. Deep information propagation. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1W1UN9gg.
  51. Unified field theoretical approach to deep and recurrent neuronal networks. 2022(10):103401, 2022.
  52. Separation of scales and a thermodynamic description of feature learning in some cnns. Nature Communications, 14(1):908, February 2023. ISSN 2041-1723. doi: 10.1038/s41467-023-36361-y.
  53. Touchette, H. The large deviation approach to statistical mechanics. 478:1–69, 2009.
  54. Williams, C. Computing with infinite networks. volume 9. MIT Press, 1996. URL https://proceedings.neurips.cc/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735-Paper.pdf.
  55. Yaida, S. Non-Gaussian processes and neural networks at finite widths. In Lu, J. and Ward, R. (eds.), Proceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107 of Proceedings of Machine Learning Research, pp.  165–192, Princeton University, Princeton, NJ, USA, 20–24 Jul 2020. PMLR. URL http://proceedings.mlr.press/v107/yaida20a.html.
  56. A theory of representation learning gives a deep generalisation of kernel methods. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  39380–39415. PMLR, Jul 2023. URL https://proceedings.mlr.press/v202/yang23k.html.
  57. Yang, G. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/5e69fda38cda2060819766569fd93aa5-Paper.pdf.
  58. Feature Learning in Infinite-Width Neural Networks. 2020. URL https://arxiv.org/abs/2011.14522.
  59. Tuning large neural networks via zero-shot hyperparameter transfer. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Bx6qKuBM2AD.
  60. Asymptotics of representation learning in finite bayesian neural networks. 2021. URL https://openreview.net/forum?id=1oRFmD0Fl-5.
  61. Contrasting random and learned features in deep bayesian linear regression. 105:064118, Jun 2022. doi: 10.1103/PhysRevE.105.064118.
Citations (4)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com