The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit (2306.17759v2)
Abstract: In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
- “Language models are few-shot learners” In Advances in Neural Information Processing Systems 33, 2020, pp. 1877–1901
- “Attention is all you need”, 2017 arXiv:1706.03762
- “Deep learning scaling is predictable, empirically”, 2017 arXiv:1712.00409
- “Scaling laws for neural language models”, 2020 arXiv:2001.08361
- “Training compute-optimal large language models”, 2022 arXiv:2203.15556
- “Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?”, 2022 arXiv:2207.10551
- “Scaling Laws for Generative Mixed-Modal Language Models”, 2023 arXiv:2301.03728
- Yihe Dong, Jean-Baptiste Cordonnier and Andreas Loukas “Attention is not all you need: Pure attention loses rank doubly exponentially with depth” In International Conference on Machine Learning, 2021, pp. 2793–2803 PMLR
- “Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse”, 2022 arXiv:2206.03126
- “Exponential expressivity in deep neural networks through transient chaos” In Advances in Neural Information Processing Systems 29, 2016
- “Deep Information Propagation” In ICLR, 2017
- Greg Yang and Samuel S Schoenholz “Mean field residual networks: on the edge of chaos” In Advances in Neural Information Processing Systems, 2017, pp. 2865–2873
- Soufiane Hayou, Arnaud Doucet and Judith Rousseau “On the impact of the activation function on deep neural networks training” In International Conference on Machine Learning, 2019, pp. 2672–2680 PMLR
- Michael Murray, Vinayak Abrol and Jared Tanner “Activation function design for deep networks: linearity and effective initialisation” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 117–154
- Lechao Xiao, Jeffrey Pennington and Samuel Schoenholz “Disentangling trainability and generalization in deep neural networks” In International Conference on Machine Learning, 2020, pp. 10462–10472 PMLR
- “On layer normalization in the transformer architecture” In International Conference on Machine Learning, 2020, pp. 10524–10533 PMLR
- “Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping”, 2021 arXiv:2110.01765
- Guodong Zhang, Aleksandar Botev and James Martens “Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers”, 2022 arXiv:2203.08120
- Mufan Li, Mihai Nica and Dan Roy “The neural covariance SDE: Shaped infinite depth-and-width networks at initialization” In Advances in Neural Information Processing Systems 35, 2022, pp. 10795–10808
- “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer”, 2022 arXiv:2203.03466
- Radford M Neal “Bayesian learning for neural networks”, 1995
- “Deep Neural Networks as Gaussian Processes” In ICLR, 2018
- “Gaussian process behaviour in wide deep neural networks”, 2018 arXiv:1804.11271
- “Infinite attention: NNGP and NTK for deep attention networks” In International Conference on Machine Learning, 2020, pp. 4376–4386 PMLR
- Arthur Jacot, Franck Gabriel and Clément Hongler “Neural tangent kernel: Convergence and generalization in neural networks” In Advances in Information Processing Systems (NeurIPS), 2018 arXiv:1806.07572
- Greg Yang and Edward J Hu “Feature learning in infinite-width neural networks”, 2020 arXiv:2011.14522
- “Mean field analysis of neural networks: A law of large numbers” In SIAM Journal on Applied Mathematics 80.2 SIAM, 2020, pp. 725–752
- Song Mei, Andrea Montanari and Phan-Minh Nguyen “A mean field view of the landscape of two-layer neural networks” In Proceedings of the National Academy of Sciences 115.33 National Acad Sciences, 2018, pp. E7665–E7671
- “On the global convergence of gradient descent for over-parameterized models using optimal transport” In Advances in Neural Information Processing Systems 31, 2018
- Grant M. Rotskoff and Eric Vanden-Eijnden “Trainability and Accuracy of Neural Networks: An Interacting Particle System Approach”, 2018 arXiv:1805.00915
- Sho Yaida “Non-Gaussian processes and neural networks at finite widths” In Mathematical and Scientific Machine Learning, 2020, pp. 165–192 PMLR
- “Asymptotics of wide networks from feynman diagrams” In arXiv preprint arXiv:1909.11304, 2019
- Daniel A Roberts, Sho Yaida and Boris Hanin “The principles of deep learning theory” Cambridge University Press, 2022
- “Exact marginal prior distributions of finite Bayesian neural networks” In Advances in Neural Information Processing Systems 34, 2021
- “Asymptotics of representation learning in finite Bayesian neural networks” In Advances in Neural Information Processing Systems 34, 2021, pp. 24765–24777
- Boris Hanin “Correlation Functions in Random Fully Connected Neural Networks at Finite Width”, 2022 arXiv:2204.01058
- Emily Dinan, Sho Yaida and Susan Zhang “Effective Theory of Transformers at Initialization”, 2023 arXiv:2304.02034
- “Products of many large random matrices and gradients in deep neural networks” In Communications in Mathematical Physics Springer, 2019, pp. 1–36
- “Finite depth and width corrections to the neural tangent kernel”, 2019 arXiv:1909.05989
- “On the Random Conjugate Kernel and Neural Tangent Kernel” In International Conference on Machine Learning, 2021, pp. 4359–4368 PMLR
- Mufan Bill Li, Mihai Nica and Dan Roy “The future is log-Gaussian: ResNets and their infinite-depth-and-width limit at initialization” In Advances in Neural Information Processing Systems 34, 2021, pp. 7852–7864
- “Precise characterization of the prior predictive distribution of deep ReLU networks” In Advances in Neural Information Processing Systems 34, 2021
- “Non-asymptotic results for singular values of Gaussian matrix products” In Geometric and Functional Analysis 31.2 Springer, 2021, pp. 268–324
- Gernot Akemann, Zdzislaw Burda and Mario Kieburg “From integrable to chaotic systems: Universal local statistics of Lyapunov exponents” In Europhysics Letters 126.4 IOP Publishing, 2019, pp. 40001
- JR Ipsen “Lyapunov exponents for products of rectangular real, complex and quaternionic Ginibre matrices” In Journal of Physics A: Mathematical and Theoretical 48.15 IOP Publishing, 2015, pp. 155204
- “Spectral radii of large non-Hermitian random matrices” In Journal of Theoretical Probability 30.1 Springer, 2017, pp. 326–364
- “Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation”, 2023 arXiv:2302.10322
- “Scaling vision transformers to 22 billion parameters”, 2023 arXiv:2302.05442
- Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization”, 2016 arXiv:1607.06450
- “Width and Depth Limits Commute in Residual Networks”, 2023 arXiv:2302.00453
- “Stable resnet” In International Conference on Artificial Intelligence and Statistics, 2021, pp. 1324–1332 PMLR
- “Dynamical isometry is achieved in residual networks in a universal way for any activation function” In The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 2221–2230 PMLR
- Soufiane Hayou “On the infinite-depth limit of finite-width neural networks”, 2022 arXiv:2210.00688
- Greg Yang “Tensor programs ii: Neural tangent kernel for any architecture” In arXiv preprint arXiv:2006.14548, 2020
- “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification” In Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034
- Andreas Veit, Michael J Wilber and Serge Belongie “Residual networks behave like ensembles of relatively shallow networks” In Advances in neural information processing systems 29, 2016
- “Stabilizing Transformer Training by Preventing Attention Entropy Collapse”, 2023 arXiv:2303.06296
- “Brownian motion and stochastic calculus” Springer Science & Business Media, 2012
- Jason Miller “Stochastic Calculus (Lecture Notes)” Cambridge University, http://www.statslab.cam.ac.uk//~jpm205/teaching/lent2016/lecture_notes.pdf, 2015
- “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020, pp. 14820–14830
- Emmanuel Abbe, Enric Boix Adsera and Theodor Misiakiewicz “The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks” In Conference on Learning Theory, 2022, pp. 4782–4887 PMLR
- “High-dimensional asymptotics of feature learning: How one gradient step improves the representation” In Int. Conf. Learning Representations (ICLR), 2022 URL: https://openreview.net/forum?id=akddwRG6EGi
- Alexandru Damian, Jason Lee and Mahdi Soltanolkotabi “Neural networks can learn representations with gradient descent” In Conference on Learning Theory, 2022, pp. 5413–5452 PMLR
- “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD”, 2022 arXiv:2209.14863
- Emmanuel Abbe, Enric Boix-Adsera and Theodor Misiakiewicz “SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics”, 2023 arXiv:2302.11055
- Raphaël Berthier, Andrea Montanari and Kangjie Zhou “Learning time-scales in two-layers neural networks”, 2023 arXiv:2303.00055
- Peter L Bartlett, Andrea Montanari and Alexander Rakhlin “Deep learning: a statistical viewpoint” In Acta numerica 30 Cambridge University Press, 2021, pp. 87–201
- O. Kallenberg “Foundations of Modern Probability”, Probability theory and stochastic modelling Springer, 2021
- HuggingFace “Wikipedia data set”, https://huggingface.co/datasets/wikipedia
- HuggingFace “Bookcorpus data set”, https://huggingface.co/datasets/bookcorpus
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- “GLUE: A multi-task benchmark and analysis platform for natural language understanding” In arXiv preprint arXiv:1804.07461, 2018
- HuggingFace “Hugging face Bert implementation”, https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py