2000 character limit reached
A mathematical perspective on Transformers (2312.10794v4)
Published 17 Dec 2023 in cs.LG, math.AP, and math.DS
Abstract: Transformers play a central role in the inner workings of LLMs. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.
- Expander graphs are globally synchronising. arXiv preprint arXiv:2210.12788, 2022.
- The Kuramoto model: A simple paradigm for synchronization phenomena. Reviews of Modern Physics, 77(1):137, 2005.
- Sumformer: Universal Approximation for Efficient Transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86. PMLR, 2023.
- Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
- Control on the manifolds of mappings with a view to the deep learning. Journal of Dynamical and Control Systems, 28(4):989–1008, 2022.
- Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
- Experimental study of energy-minimizing point configurations on spheres. Experimental Mathematics, 18(3):257–283, 2009.
- Infinite time aggregation for the critical Patlak-Keller-Segel model in ℝ2superscriptℝ2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Communications on Pure and Applied Mathematics, 61(10):1449–1481, 2008.
- On the complete phase synchronization for the Kuramoto model in the mean-field limit. Communications in Mathematical Sciences, 13(7):1775–1786, 2015.
- Geodesic distance Riesz energy on the sphere. Transactions of the American Mathematical Society, 372(5):3141–3166, 2019.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- On the mean speed of convergence of empirical and occupation measures in Wasserstein distance. In Annales de l’IHP Probabilités et statistiques, volume 50, pages 539–563, 2014.
- Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT theory for the multidimensional aggregation equation. Communications on Pure and Applied Mathematics, 64(1):45–83, 2011.
- Deep learning probability flows and entropy production rates in active matter. arXiv preprint arXiv:2309.12991, 2023.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in Neural Information Processing Systems, 31, 2018.
- Contractivity of transport distances for the kinetic Kuramoto equation. Journal of Statistical Physics, 156(2):395–415, 2014.
- Propagation of chaos: a review of models, methods and applications. II. Applications. 2022.
- Global-in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations. Duke Mathematical Journal, 156(2):229 – 271, 2011.
- Hayato Chiba. A proof of the Kuramoto conjecture for a bifurcation structure of the infinite-dimensional Kuramoto model. Ergodic Theory and Dynamical Systems, 35(3):762–834, 2015.
- A consensus-based global optimization method for high dimensional machine learning problems. ESAIM: Control, Optimisation and Calculus of Variations, 27:S5, 2021.
- Universally optimal distribution of points on spheres. Journal of the American Mathematical Society, 20(1):99–148, 2007.
- Universal optimality of the E8subscript𝐸8E_{8}italic_E start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT and Leech lattices and interpolation formulas. Annals of Mathematics, 196(3):983–1082, 2022.
- Interpolation, approximation and controllability of deep neural networks. arXiv preprint arXiv:2309.06015, 2023.
- A nonlinear model of opinion formation on the sphere. Discrete & Continuous Dynamical Systems-A, 35(9):4241–4268, 2015.
- Deep neural networks, generic universal interpolation, and controlled ODEs. SIAM Journal on Mathematics of Data Science, 2(3):901–919, 2020.
- Neural ordinary differential equations. Advances in Neural Information Processing Systems, 31, 2018.
- Emergent behavior in flocks. IEEE Transactions on Automatic Control, 52(5):852–862, 2007.
- George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
- Hiroaki Daido. Order function and macroscopic mutual entrainment in uniformly coupled limit-cycle oscillators. Progress of Theoretical Physics, 88(6):1213–1218, 1992.
- Landau damping to partially locked states in the Kuramoto model. Communications on Pure and Applied Mathematics, 71(5):953–993, 2018.
- Redesigning the transformer architecture with insights from multi-particle dynamical systems. Advances in Neural Information Processing Systems, 34:5531–5544, 2021.
- Spherical codes and designs. In Geometry and Combinatorics, pages 68–93. Elsevier, 1991.
- Roland L’vovich Dobrushin. Vlasov equations. Funktsional’nyi Analiz i ego Prilozheniya, 13(2):48–58, 1979.
- R. M. Dudley. The Speed of Mean Glivenko-Cantelli Convergence. The Annals of Mathematical Statistics, 40(1):40 – 50, 1969.
- Approximation theory and harmonic analysis on spheres and balls. Springer, 2013.
- Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1(5):1–11, 2017.
- Landau damping in the Kuramoto model. In Annales Henri Poincaré, volume 17, pages 1793–1823. Springer, 2016.
- Consensus-based optimization on the sphere: Convergence to global minimizers and machine learning. The Journal of Machine Learning Research, 22(1):10722–10776, 2021.
- Long-time dynamics for a simple aggregation equation on the sphere. In Stochastic Dynamics Out of Equilibrium: Institut Henri Poincaré, Paris, France, 2017, pages 457–479. Springer, 2019.
- Deep Learning. MIT Press, Cambridge, MA, 2016.
- Uniform in time propagation of chaos for the 2d vortex model and other singular stochastic systems. arXiv preprint arXiv:2108.08675, 2021.
- The emergence of clusters in self-attention dynamics. arXiv preprint arXiv:2305.05465, 2023.
- François Golse. On the dynamics of large particle systems in the mean field limit. Macroscopic and large scale phenomena: coarse graining, mean field limits and ergodicity, pages 1–144, 2016.
- Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numerica, 31:135–263, 2022.
- A class of dimension-free metrics for the convergence of empirical measures. Stochastic Processes and their Applications, 164:242–287, 2023.
- Opinion dynamics and bounded confidence: models, analysis and simulation. Journal of Artifical Societies and Social Simulation (JASSS), 5(3), 2002.
- Collective synchronization of classical and quantum oscillators. EMS Surveys in Mathematical Sciences, 3(2):209–267, 2016.
- On the relaxation dynamics of Lohe oscillators on some Riemannian manifolds. Journal of Statistical Physics, 172:1427–1478, 2018.
- Stable architectures for deep neural networks. Inverse problems, 34(1), 2017.
- Asymptotic phase-locking dynamics and critical coupling strength for the Kuramoto model. Communications in Mathematical Physics, 377(2):811–857, 2020.
- Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645. Springer, 2016.
- The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
- A brief survey on the approximation theory for sequence modelling. arXiv preprint arXiv:2302.13752, 2023.
- Clustering and asymptotic behavior in opinion formation. Journal of Differential Equations, 257(11):4165–4187, 2014.
- Upper bounds on coarsening rates. Communications in Mathematical Physics, 229(3):375–395, 2002.
- Ulrich Krause. A discrete nonlinear and non-autonomous model of consensus. In Communications in Difference Equations: Proceedings of the Fourth International Conference on Difference Equations, page 227. CRC Press, 2000.
- Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
- Yoshiki Kuramoto. Self-entrainment of a population of coupled non-linear oscillators. In International Symposium on Mathematical Problems in Theoretical Physics: January 23–29, 1975, Kyoto University, Kyoto/Japan, pages 420–422. Springer, 1975.
- Daniel Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions. Probability and Mathematical Physics, 4(2):377–432, 2023.
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, 2020.
- Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18:1–29, 2018.
- Resnet with one-neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31, 2018.
- Sharp uniform-in-time propagation of chaos. Probability Theory and Related Fields, pages 1–38, 2023.
- Understanding and improving transformer from a multi-particle dynamic system point of view. In International Conference on Learning Representations, 2020.
- Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671–1709, 2022.
- Stanislaw Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87–89, 1963.
- A survey of transformers. AI Open, 2022.
- On the landscape of synchronization networks: A perspective from nonconvex optimization. SIAM Journal on Optimization, 29(3):1879–1907, 2019.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Heterophilious dynamics enhances consensus. SIAM Review, 56(4):577–621, 2014.
- Slow motion of gradient flows. Journal of Differential Equations, 237(2):372–420, 2007.
- Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
- A consensus-based model for global optimization and its mean-field limit. Mathematical Models and Methods in Applied Sciences, 27(01):183–204, 2017.
- Neural ODE control for classification, approximation, and transport. SIAM Review, 65(3):735–773, 2023.
- Global-in-time mean-field convergence for singular Riesz-type diffusive flows. The Annals of Applied Probability, 33(2):954–998, 2023.
- Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022.
- Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022.
- Sylvia Serfaty. Mean field limit for Coulomb-type flows. Duke Mathematical Journal, 169(15), 2020.
- On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 – 1599, 2012.
- Michael Shub. Global stability of dynamical systems. Springer Science & Business Media, 2013.
- Steven H Strogatz. From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators. Physica D: Nonlinear Phenomena, 143(1-4):1–20, 2000.
- Gabor Szegö. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
- Eitan Tadmor. Swarming: hydrodynamic alignment with pressure. Bulletin of the American Mathematical Society, 60(3):285–325, 2023.
- Yan Shuo Tan. Energy optimization for distributions on the sphere and improvement to the Welch bounds. Electronic Communications in Probability, 22(none):1 – 12, 2017.
- Richard Taylor. There is no non-zero stable fixed point for dense networks in the homogeneous Kuramoto model. Journal of Physics A: Mathematical and Theoretical, 45(5):055102, 2012.
- Universal approximation power of deep residual neural networks through the lens of control. IEEE Transactions on Automatic Control, 2022.
- Dense networks that do not synchronize and sparse ones that do. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(8), 2020.
- Novel type of phase transition in a system of self-driven particles. Physical Review Letters, 75(6):1226, 1995.
- Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- Cédric Villani. Limite de champ moyen. Cours de DEA, 2002:49, 2001.
- Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
- Nonlinear controllability and function representation by neural stochastic differential equations. In Learning for Dynamics and Control Conference, pages 838–850. PMLR, 2023.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- James G Wendel. A problem in geometric probability. Mathematica Scandinavica, 11(1):109–111, 1962.
- Introduction to Transformers: an NLP Perspective. arXiv preprint arXiv:2311.17633, 2023.
- Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
- Approximation capabilities of neural ODEs and invertible residual networks. In International Conference on Machine Learning, pages 11086–11095. PMLR, 2020.