Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proximal Mean Field Learning in Shallow Neural Networks

Published 25 Oct 2022 in cs.LG and stat.ML | (2210.13879v3)

Abstract: We propose a custom learning algorithm for shallow over-parameterized neural networks, i.e., networks with single hidden layer having infinite width. The infinite width of the hidden layer serves as an abstraction for the over-parameterization. Building on the recent mean field interpretations of learning dynamics in shallow neural networks, we realize mean field learning as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow for the learning dynamics over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Optimizing functionals on the space of probabilities with input convex neural network. In Annual Conference on Neural Information Processing Systems, 2021.
  2. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  3. Input convex neural networks. In International Conference on Machine Learning, pp. 146–155. PMLR, 2017.
  4. A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193, 2019.
  5. Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  6. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.
  7. An augmented Lagrangian approach to Wasserstein gradient flows and applications. ESAIM: Proceedings and surveys, 54:1–17, 2016.
  8. Dimitri P Bertsekas et al. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
  9. Sliced-Wasserstein gradient flows. arXiv preprint arXiv:2110.10972, 2021.
  10. Efficient gradient flows in sliced-wasserstein space. arXiv preprint arxiv:2110.10972, 2022.
  11. Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs. Advances in Neural Information Processing Systems, 35:20105–20118, 2022.
  12. Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  13. Proximal optimal transport modeling of population dynamics. In International Conference on Artificial Intelligence and Statistics, pp.  6511–6528. PMLR, 2022.
  14. Gradient flow algorithms for density propagation in stochastic systems. IEEE Transactions on Automatic Control, 65(10):3991–4004, 2019a.
  15. Proximal recursion for solving the Fokker-Planck equation. In 2019 American Control Conference (ACC), pp.  4098–4103. IEEE, 2019b.
  16. Reflected Schrödinger bridge: Density control with path constraints. In 2021 American Control Conference (ACC), pp.  1137–1142. IEEE, 2021a.
  17. Wasserstein proximal algorithms for the Schrödinger bridge problem: Density control with nonlinear drift. IEEE Transactions on Automatic Control, 67(3):1163–1178, 2021b.
  18. Guillaume Carlier. On the linear convergence of the multimarginal sinkhorn algorithm. SIAM Journal on Optimization, 32(2):786–794, 2022.
  19. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017.
  20. Probabilistic theory of mean field games with applications I-II. Springer, 2018.
  21. Primal dual methods for Wasserstein gradient flows. Foundations of Computational Mathematics, 22(2):389–443, 2022.
  22. Deep momentum multi-marginal schr\\\backslash\" odinger bridge. arXiv preprint arXiv:2303.01751, 2023.
  23. Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrodinger bridge. Siam Review, 63(2):249–313, 2021.
  24. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  25. Probability functional descent: A unifying perspective on GANS, variational inference, and reinforcement learning. In International Conference on Machine Learning, pp. 1213–1222. PMLR, 2019.
  26. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  27. A mean-field analysis of two-player zero-sum games. Advances in neural information processing systems, 33:20215–20226, 2020.
  28. UCI machine learning machine learning repository. 2017. URL "http://archive.ics.uci.edu/ml".
  29. Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory, pp.  1887–1936. PMLR, 2021.
  30. Approximate inference with Wasserstein gradient flows. In International Conference on Artificial Intelligence and Statistics, pp.  2581–2590. PMLR, 2020.
  31. Multimarginal optimal transport with a tree-structured cost and the schrodinger bridge problem. SIAM Journal on Control and Optimization, 59(4):2428–2453, 2021.
  32. Gradient flows in uncertainty propagation and filtering of linear gaussian systems. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp.  3081–3088. IEEE, 2017.
  33. Gradient flows in filtering and Fisher-Rao geometry. In 2018 Annual American Control Conference (ACC), pp. 4281–4286. IEEE, 2018.
  34. Proximal recursion for the Wonham filter. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp.  660–665. IEEE, 2019.
  35. Hopfield neural network flow: A geometric viewpoint. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4869–4880, 2020.
  36. Stochastic uncertainty propagation in power system dynamics using measure-valued proximal recursions. IEEE Transactions on Power Systems, 2022.
  37. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  38. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  39. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  40. Mark Kac. Foundations of kinetic theory. In Proceedings of The third Berkeley symposium on mathematical statistics and probability, volume 3, pp.  171–197, 1956.
  41. Generalized Sinkhorn iterations for regularizing inverse problems using optimal mass transport. SIAM Journal on Imaging Sciences, 10(4):1935–1962, 2017.
  42. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  43. Maxime Laborde. On some nonlinear evolution systems which are perturbations of Wasserstein gradient flows. Topological Optimization and Optimal Transport: In the Applied Sciences, 17:304–332, 2017.
  44. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  45. Christian Léonard. From the schrödinger problem to the monge–kantorovich problem. Journal of Functional Analysis, 262(4):1879–1920, 2012.
  46. On random deep weight-tied autoencoders: Exact asymptotic analysis, phase transitions, and implications to training. In International Conference on Learning Representations, 2018.
  47. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pp. 4082–4092. PMLR, 2019.
  48. Yulong Lu. Two-scale gradient descent ascent dynamics finds mixed nash equilibria of continuous games: A mean-field perspective. In International Conference on Machine Learning, pp. 22790–22811. PMLR, 2023.
  49. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
  50. Henry P McKean Jr. A class of Markov processes associated with nonlinear parabolic equations. Proceedings of the National Academy of Sciences, 56(6):1907–1911, 1966.
  51. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  52. Large-scale Wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243–15256, 2021.
  53. On the convergence of gradient descent in GANs: MMD GAN as a gradient flow. In International Conference on Artificial Intelligence and Statistics, pp.  1720–1728. PMLR, 2021.
  54. Phan-Minh Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880, 2019.
  55. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2019.
  56. Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
  57. Automatic differentiation in pytorch. 2017.
  58. Gabriel Peyré. Entropic approximation of Wasserstein gradient flows. SIAM Journal on Imaging Sciences, 8(4):2323–2351, 2015.
  59. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp.  435–446. Springer, 2011.
  60. R. Tyrrell Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research, 1(2):97–116, 1976a.
  61. R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976b.
  62. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022.
  63. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. stat, 1050:22, 2018.
  64. The Wasserstein proximal gradient algorithm. Advances in Neural Information Processing Systems, 33:12356–12366, 2020.
  65. Filippo Santambrogio. {{\{{Euclidean, metric, and Wasserstein}}\}} gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
  66. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
  67. Mean field analysis of deep neural networks. Mathematics of Operations Research, 47(1):120–152, 2022.
  68. Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d’été de probabilités de Saint-Flour XIX—1989, pp.  165–251. Springer, 1991.
  69. Marc Teboulle. Entropic proximal mappings with applications to nonlinear programming. Mathematics of Operations Research, 17(3):670–690, 1992.
  70. Anthony C Thompson. On certain contraction mappings in a partially ordered vector space. Proceedings of the American Mathematical Society, 14(3):438–443, 1963.
  71. Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
  72. Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
  73. Confusion matrix-based feature selection. MAICS, 710:120–127, 2011.
  74. Stephan Wojtowytsch and E Weinan. Can shallow neural networks beat the curse of dimensionality? a mean field training perspective. IEEE Transactions on Artificial Intelligence, 1(2):121–129, 2020.
  75. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pp. 5393–5402. PMLR, 2018.
  76. Normalization effects on deep neural networks. Foundations of Data Science, 5(3):389–465, 2023.
  77. Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583, 2020.
  78. Policy optimization as Wasserstein gradient flows. In International Conference on Machine Learning, pp. 5737–5746. PMLR, 2018.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.