Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wilsonian Renormalization of Neural Network Gaussian Processes (2405.06008v2)

Published 9 May 2024 in cs.LG, cond-mat.dis-nn, hep-th, and stat.ML

Abstract: Separating relevant and irrelevant information is key to any modeling process or scientific inquiry. Theoretical physics offers a powerful tool for achieving this in the form of the renormalization group (RG). Here we demonstrate a practical approach to performing Wilsonian RG in the context of Gaussian Process (GP) Regression. We systematically integrate out the unlearnable modes of the GP kernel, thereby obtaining an RG flow of the GP in which the data sets the IR scale. In simple cases, this results in a universal flow of the ridge parameter, which becomes input-dependent in the richer scenario in which non-Gaussianities are included. In addition to being analytically tractable, this approach goes beyond structural analogies between RG and neural networks by providing a natural connection between RG flow and learnable vs. unlearnable modes. Studying such flows may improve our understanding of feature learning in deep neural networks, and enable us to identify potential universality classes in these models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. M. Peskin and D. Schroeder, An Introduction To Quantum Field Theory. Frontiers in Physics. Avalon Publishing, 1995. https://books.google.co.il/books?id=EVeNNcslvX0C.
  2. J. Cardy, Scaling and Renormalization in Statistical Physics. Cambridge lecture notes in physics. Cambridge University Press, 1996. https://books.google.co.il/books?id=g5hfPgAACAAJ.
  3. K. G. Wilson and J. Kogut, “The renormalization group and the ϵitalic-ϵ\epsilonitalic_ϵ expansion,” Physics Reports 12 no. 2, (1974) 75–199.
  4. D. J. Gross and F. Wilczek, “Ultraviolet behavior of non-abelian gauge theories,” Phys. Rev. Lett. 30 (Jun, 1973) 1343–1346.
  5. J. M. Kosterlitz and D. J. Thouless, “Ordering, metastability and phase transitions in two-dimensional systems,” Journal of Physics C Solid State Physics 6 no. 7, (Apr., 1973) 1181–1203.
  6. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv:2001.08361 [cs.LG].
  7. P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” arXiv:1611.06440 [cs.LG].
  8. Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma, “Explaining neural scaling laws,” arXiv:2102.06701 [cs.LG].
  9. B. Bordelon, A. Canatar, and C. Pehlevan, “Spectrum dependent learning curves in kernel regression and wide neural networks,” arXiv:2002.02561 [cs.LG].
  10. R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, and J. N. Sohl-Dickstein, “Bayesian deep convolutional networks with many channels are gaussian processes,” in International Conference on Learning Representations. 2018. https://api.semanticscholar.org/CorpusID:57721101.
  11. J. Lee, S. S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. N. Sohl-Dickstein, “Finite versus infinite neural networks: an empirical study,” ArXiv abs/2007.15801 (2020) .
  12. R. M. Neal, “Bayesian learning for neural networks,” PhD thesis, University of Toronto (1995) .
  13. A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” Advances in neural information processing systems 31 (2018) .
  14. J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri, “Deep neural networks as gaussian processes,” in International Conference on Learning Representations. 2018. https://openreview.net/forum?id=B1EA-M-0Z.
  15. A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani, “Gaussian process behaviour in wide deep neural networks,” arXiv preprint arXiv:1804.11271 (2018) .
  16. O. Cohen, O. Malka, and Z. Ringel, “Learning curves for overparametrized deep neural networks: A field theory perspective,” Physical Review Research 3 no. 2, (Apr., 2021) .
  17. Q. Li and H. Sompolinsky, “Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization,” Phys. Rev. X 11 (Sep, 2021) 031059.
  18. I. Seroussi, G. Naveh, and Z. Ringel, “Separation of scales and a thermodynamic description of feature learning in some cnns,” Nature Communications 14 no. 1, (Feb, 2023) 908.
  19. S. Ariosto, R. Pacelli, M. Pastore, F. Ginelli, M. Gherardi, and P. Rotondo, “Statistical mechanics of deep learning beyond the infinite-width limit,” arXiv preprint arXiv:2209.04882 (2022) .
  20. C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
  21. A. Canatar, B. Bordelon, and C. Pehlevan, “Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks,” Nature communications 12 no. 1, (2021) 2914.
  22. J. Halverson, A. Maiti, and K. Stoner, “Neural Networks and Quantum Field Theory,” Mach. Learn. Sci. Tech. 2 no. 3, (2021) 035002, arXiv:2008.08601 [cs.LG].
  23. H. Erbin, V. Lahoche, and D. Ousmane Samary, “Non-perturbative renormalization for the neural network-qft correspondence,” Machine Learning: Science and Technology 3 no. 1, (Feb., 2022) 015027.
  24. H. Erbin, V. Lahoche, and D. O. Samary, “Renormalization in the neural network-quantum field theory correspondence,” arXiv:2212.11811 [hep-th].
  25. H. Erbin, R. Finotello, B. W. Kpera, V. Lahoche, and D. O. Samary, “Functional renormalization group for signal detection and stochastic ergodicity breaking,” arXiv:2310.07499 [hep-th].
  26. K. T. Grosvenor and R. Jefferson, “The edge of chaos: quantum field theory and deep neural networks,” SciPost Phys. 12 no. 3, (2022) 081, arXiv:2109.13247 [hep-th].
  27. Cambridge University Press, May, 2022. http://dx.doi.org/10.1017/9781009023405.
  28. J. Erdmenger, K. T. Grosvenor, and R. Jefferson, “Towards quantifying information flows: relative entropy in deep neural networks and the renormalization group,” SciPost Phys. 12 no. 1, (2022) 041, arXiv:2107.06898 [hep-th].
  29. J. Cotler and S. Rezchikov, “Renormalization group flow as optimal transport,” Physical Review D 108 no. 2, (July, 2023) .
  30. J. Cotler and S. Rezchikov, “Renormalizing diffusion models,” arXiv:2308.12355 [hep-th].
  31. D. S. Berman, J. J. Heckman, and M. Klinger, “On the dynamics of inference and learning,” arXiv:2204.12939 [cond-mat.dis-nn].
  32. D. S. Berman and M. S. Klinger, “The inverse of exact renormalization group flows as statistical inference,” arXiv:2212.11379 [hep-th].
  33. D. S. Berman, M. S. Klinger, and A. G. Stapleton, “Bayesian renormalization,” Machine Learning: Science and Technology 4 no. 4, (Oct., 2023) 045011.
  34. D. S. Berman, M. S. Klinger, and A. G. Stapleton, “Ncoder – a quantum field theory approach to encoding data,” 2024.
  35. A. B. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, “Scaling and renormalization in high-dimensional regression,” 2024.
  36. D. Malzahn and M. Opper, “A variational approach to learning curves,” in Advances in Neural Information Processing Systems, T. Dietterich, S. Becker, and Z. Ghahramani, eds., vol. 14. MIT Press, 2001. https://proceedings.neurips.cc/paper_files/paper/2001/file/26f5bd4aa64fdadf96152ca6e6408068-Paper.pdf.
  37. P. Sollich and C. K. I. Williams, “Understanding gaussian process regression using the equivalent kernel,” in Proceedings of the First International Conference on Deterministic and Statistical Methods in Machine Learning, p. 211–228. Springer-Verlag, Berlin, Heidelberg, 2004. https://doi.org/10.1007/11559887_13.
  38. J. A. Zavatone-Veth and C. Pehlevan, “Exact marginal prior distributions of finite bayesian neural networks,” arXiv:2104.11734 [cs.LG].
  39. U. M. Tomasini, A. Sclocchi, and M. Wyart, “Failure and success of the spectral bias prediction for kernel ridge regression: the case of low-dimensional data,” arXiv:2202.03348 [cs.LG].
  40. G. Naveh, O. Ben David, H. Sompolinsky, and Z. Ringel, “Predicting the outputs of finite deep neural networks trained with noisy gradients,” Physical Review E 104 no. 6, (Dec., 2021) .
Citations (3)

Summary

  • The paper introduces a Wilsonian renormalization framework to simplify deep neural networks modeled as Gaussian Processes by integrating out high-energy modes.
  • It leverages eigenfunction analysis of the kernel to adjust regression parameters, revealing distinct learning modes and computational benefits.
  • The study bridges theoretical physics and machine learning, suggesting new training algorithms and universal principles in learning dynamics.

Exploring the Renormalization Group Approach in Deep Neural Networks for Gaussian Processes

Introduction to Renormalization Group and Gaussian Processes

The concept of the renormalization group (RG) has been a formidable tool in theoretical physics, particularly in understanding phenomena across various scales. This paper extends the RG framework to explore the behaviors and training dynamics of Deep Neural Networks (DNNs) modeled as Gaussian Processes (GPs). The adoption of RG principles in deep learning provides new perspectives on how these networks manage large-scale data and parameter spaces, drawing parallels with complex physical systems.

Understanding the Gaussian Process Model for DNNs

Gaussian Processes (GPs) offer a robust statistical framework often used in regression and classification tasks. By modeling DNNs as GPs, the paper explores an infinite-overparametrization limit where the networks behave as GPs with zero mean and a specific covariance function termed the kernel. This viewpoint simplifies the analysis as it reduces the complex network to a function characterized by this kernel.

Key insights from the GP perspective:

  1. Quadratic Nature of GP: The GP representation effectively captures the outputs of the network with a quadratic form, simplifying the dynamics under paper.
  2. Eigenfunctions and Modes: In this framework, the DNN's behaviors are dissected through eigenfunctions of the kernel. These eigenfunctions help categorize the network outputs into modes based on their contribution to the overall function prediction.
  3. High-energy Modes: Intriguingly, fluctuations along the high-energy modes (having low kernel eigenvalues) are significantly suppressed, sharing similarities with high-energy modes in physical systems which are typically disregarded in low-energy effective theories.

Application of the Wilsonian Renormalization Group to Gaussian Processes

The crux of the paper pivots on applying Wilsonian RG principles to GPs. This approach systematically "integrates out" the high-energy modes from the GP, refining the model by considering only the significant contributions. Such an RG treatment results in a renormalized version of the GP, where irrelevant details are smoothed out, enhancing computational efficiency and potentially revealing fundamental scaling laws in learning dynamics.

Process and Implications:

  • Integrating Out High-energy Modes: The process involves assessing the kernel's eigensystem, identifying high-energy modes, and mathematically eliminating their effects from the GP representation.
  • Renormalization of Parameters: Specifically, the integration process adjusts parameters like the ridge parameter in regression, reflecting a change in how data variance is handled.
  • Emergence of New Learning Dynamics: Post-renormalization, the GP exhibits altered learning dynamics which could be more robust or efficient, paving the way for novel training algorithms inspired by theoretical physics.

Future Directions and Theoretical Implications

The incorporation of RG into studying GPs opens several avenues for future research. Theoretically, it nudges us towards a unified treatment of learning systems, drawing from field theory and statistical physics. Practically, it helps refine the models for better performance and understanding.

  • Potential Universality in Learning Systems: The paper hints at the intriguing possibility of discovering universal behaviors in learning systems, akin to universality in physical systems near critical points.
  • Enhanced Training Algorithms: By understanding which aspects of a network are crucial and which are redundant, training procedures can be optimized to focus on significant parameters, potentially leading to faster convergence and better generalization.

Conclusion

The exploration of RG techniques in the context of GPs and DNNs isn't just a theoretical exercise but a prospective framework for developing more sophisticated and principled machine learning models. This paper elegantly bridges the gap between abstract physical concepts and practical learning algorithms, promising exciting developments in the AI research landscape.