Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective (2403.14917v2)
Abstract: In this paper, we study the feature learning ability of two-layer neural networks in the mean-field regime through the lens of kernel methods. To focus on the dynamics of the kernel induced by the first layer, we utilize a two-timescale limit, where the second layer moves much faster than the first layer. In this limit, the learning problem is reduced to the minimization problem over the intrinsic kernel. Then, we show the global convergence of the mean-field Langevin dynamics and derive time and particle discretization error. We also demonstrate that two-layer neural networks can learn a union of multiple reproducing kernel Hilbert spaces more efficiently than any kernel methods, and neural networks acquire data-dependent kernel which aligns with the target function. In addition, we develop a label noise procedure, which converges to the global optimum and show that the degrees of freedom appears as an implicit regularization.
- Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, pp. 322–332. PMLR, May 2019a. URL https://proceedings.mlr.press/v97/arora19a.html. ISSN: 2640-3498.
- On Exact Computation with an Infinitely Wide Neural Net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://papers.neurips.cc/paper_files/paper/2019/hash/dbc4d84bfcfe2284ba11beffb853a8c4-Abstract.html.
- Neural Networks as Kernel Learners: The Silent Alignment Effect. In International Conference on Learning Representations, October 2021. URL https://openreview.net/forum?id=1NvflqAdoom.
- Bach, F. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. Journal of Machine Learning Research, 18(21):1–38, 2017. ISSN 1533-7928. URL http://jmlr.org/papers/v18/15-178.html.
- Multiple kernel learning, conic duality, and the SMO algorithm. In Twenty-first international conference on Machine learning - ICML ’04, pp. 6, Banff, Alberta, Canada, 2004. ACM Press. doi: 10.1145/1015330.1015424. URL http://portal.acm.org/citation.cfm?doid=1015330.1015424.
- Diffusions hypercontractives. Séminaire de probabilités de Strasbourg, 19:177–206, 1985. URL https://eudml.org/doc/113511. Publisher: Springer - Lecture Notes in Mathematics.
- Implicit Regularization via Neural Feature Alignment. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp. 2269–2277. PMLR, March 2021. URL https://proceedings.mlr.press/v130/baratin21a.html. ISSN: 2640-3498.
- Barron, A. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, May 1993. ISSN 0018-9448, 1557-9654. doi: 10.1109/18.256500. URL https://ieeexplore.ieee.org/document/256500/.
- Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. In Goos, G., Hartmanis, J., Van Leeuwen, J., Helmbold, D., and Williamson, B. (eds.), Computational Learning Theory, volume 2111, pp. 224–240. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001. ISBN 978-3-540-42343-0 978-3-540-44581-4. doi: 10.1007/3-540-44581-1˙15. URL http://link.springer.com/10.1007/3-540-44581-1_15. Series Title: Lecture Notes in Computer Science.
- Learning single-index models with shallow neural networks. In Advances in Neural Information Processing Systems, May 2022. URL https://openreview.net/forum?id=wt7cd9m2cz2.
- On Learning Gaussian Multi-index Models with Gradient Flow, November 2023. URL http://arxiv.org/abs/2310.19793. arXiv:2310.19793 [cs, math, stat].
- Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3):331–368, July 2007. ISSN 1615-3375, 1615-3383. doi: 10.1007/s10208-006-0196-8. URL http://link.springer.com/10.1007/s10208-006-0196-8.
- Uniform-in-time propagation of chaos for mean field Langevin dynamics, November 2023. URL http://arxiv.org/abs/2212.03050. arXiv:2212.03050 [math, stat].
- A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks. In Advances in Neural Information Processing Systems, volume 33, pp. 13363–13373. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/9afe487de556e59e6db6c862adfe25a4-Abstract.html.
- Chizat, L. Mean-Field Langevin Dynamics : Exponential Convergence and Annealing. Transactions on Machine Learning Research, May 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=BDqzLH1gEm.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018. URL https://proceedings-neurips-cc.utokyo.idm.oclc.org/paper_files/paper/2018/hash/a1afc58c6ca9540d057299ec3016d726-Abstract.html.
- On Kernel-Target Alignment. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URL https://proceedings.neurips.cc/paper_files/paper/2001/hash/1f71e393b3809197ed66df836fe833e5-Abstract.html.
- Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461, 2021. URL https://proceedings-neurips-cc.utokyo.idm.oclc.org/paper/2021/hash/e6af401c28c1790eaef7d55c92ab6ab6-Abstract.html.
- Neural Networks can Learn Representations with Gradient Descent. In Proceedings of Thirty Fifth Conference on Learning Theory, pp. 5413–5452. PMLR, June 2022. URL https://proceedings.mlr.press/v178/damian22a.html. ISSN: 2640-3498.
- A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407–1425, 2019. ISSN 15396746, 19450796. doi: 10.4310/CMS.2019.v17.n5.a11. URL https://www.intlpress.com/site/pub/pages/journals/items/cms/content/vols/0017/0005/a011/.
- Over Parameterized Two-level Neural Networks Can Learn Near Optimal Feature Representations, October 2019. URL http://arxiv.org/abs/1910.11508. arXiv:1910.11508 [cs, math, stat].
- On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Networks, 123:343–361, March 2020. ISSN 0893-6080. doi: 10.1016/j.neunet.2019.12.014. URL https://www.sciencedirect.com/science/article/pii/S089360801930406X.
- Logarithmic Sobolev inequalities and stochastic Ising models. Journal of Statistical Physics, 46(5):1159–1194, March 1987. ISSN 1572-9613. doi: 10.1007/BF01011161. URL https://doi.org/10.1007/BF01011161.
- Hsu, D. Dimension lower bounds for linear approaches to function approximation. Daniel Hsu’s homepage, 2021.
- Mean-Field Langevin Dynamics and Energy Landscape of Neural Networks, December 2020. URL http://arxiv.org/abs/1905.07769. arXiv:1905.07769 [math, stat].
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/hash/5a4be1fa34e62bb8a6ec6b91d2462f5a-Abstract.html.
- What Happens after SGD Reaches Zero Loss? –A Mathematical Framework. In International Conference on Learning Representations, October 2021. URL https://openreview.net/forum?id=siCt4xZn5Ve.
- The Barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022. URL https://link.springer.com/article/10.1007/s00365-021-09549-y. Publisher: Springer.
- Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview-net.utokyo.idm.oclc.org/forum?id=Y2hnMZvVDm.
- Leveraging the two timescale regime to demonstrate convergence of neural networks, October 2023. URL http://arxiv.org/abs/2304.09576. arXiv:2304.09576 [cs, math, stat].
- Maurer, A. A Vector-Contraction Inequality for Rademacher Complexities. In Ortner, R., Simon, H. U., and Zilles, S. (eds.), Algorithmic Learning Theory, Lecture Notes in Computer Science, pp. 3–17, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7. doi: 10.1007/978-3-319-46379-7˙1.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33), August 2018. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1806579115. URL https://pnas.org/doi/full/10.1073/pnas.1806579115.
- Neural Networks Efficiently Learn Low-Dimensional Representations with SGD. In The Eleventh International Conference on Learning Representations, 2022. URL https://openreview-net.utokyo.idm.oclc.org/forum?id=6taykzqcPD.
- Stochastic Particle Gradient Descent for Infinite Ensembles, December 2017. URL http://arxiv.org/abs/1712.05438. arXiv:1712.05438 [cs, math, stat].
- Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime. In International Conference on Learning Representations, 2020. URL https://openreview-net.utokyo.idm.oclc.org/forum?id=PULSD5qI2N1.
- Convex Analysis of the Mean Field Langevin Dynamics. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 9741–9757. PMLR, May 2022. URL https://proceedings.mlr.press/v151/nitanda22a.html. ISSN: 2640-3498.
- Optimal criterion for feature learning of two-layer linear neural network in high dimensional interpolation regime. In The Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=Jc0FssXh2R.
- Suzuki, T. Fast generalization error bound of deep learning from a kernel perspective. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 1397–1406. PMLR, March 2018. URL https://proceedings.mlr.press/v84/suzuki18a.html. ISSN: 2640-3498.
- Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 2839–2846, Yokohama, Japan, July 2020. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-6-5. doi: 10.24963/ijcai.2020/393. URL https://www.ijcai.org/proceedings/2020/393.
- Uniform-in-time propagation of chaos for the mean-field gradient Langevin dynamics. In The Eleventh International Conference on Learning Representations, September 2022. URL https://openreview.net/forum?id=_JScUk9TBUn.
- Convergence of mean-field Langevin dynamics: time-space discretization, stochastic gradient, and variance reduction. In Thirty-seventh Conference on Neural Information Processing Systems, November 2023a. URL https://openreview.net/forum?id=9STYRIVx6u.
- Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond. In Thirty-seventh Conference on Neural Information Processing Systems, November 2023b. URL https://https://openreview.net/forum?id=tj86aGVNb3.
- Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation. In Proceedings of Thirty Fifth Conference on Learning Theory, pp. 2127–2159. PMLR, June 2022. URL https://proceedings.mlr.press/v178/vivien22a.html. ISSN: 2640-3498.
- Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- On the Power and Limitations of Random Features for Understanding Neural Networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/hash/5481b2f34a74e427a2818014b8e103b0-Abstract.html.