Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit (2306.17759v2)

Published 30 Jun 2023 in stat.ML and cs.LG

Abstract: In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. “Language models are few-shot learners” In Advances in Neural Information Processing Systems 33, 2020, pp. 1877–1901
  2. “Attention is all you need”, 2017 arXiv:1706.03762
  3. “Deep learning scaling is predictable, empirically”, 2017 arXiv:1712.00409
  4. “Scaling laws for neural language models”, 2020 arXiv:2001.08361
  5. “Training compute-optimal large language models”, 2022 arXiv:2203.15556
  6. “Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?”, 2022 arXiv:2207.10551
  7. “Scaling Laws for Generative Mixed-Modal Language Models”, 2023 arXiv:2301.03728
  8. Yihe Dong, Jean-Baptiste Cordonnier and Andreas Loukas “Attention is not all you need: Pure attention loses rank doubly exponentially with depth” In International Conference on Machine Learning, 2021, pp. 2793–2803 PMLR
  9. “Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse”, 2022 arXiv:2206.03126
  10. “Exponential expressivity in deep neural networks through transient chaos” In Advances in Neural Information Processing Systems 29, 2016
  11. “Deep Information Propagation” In ICLR, 2017
  12. Greg Yang and Samuel S Schoenholz “Mean field residual networks: on the edge of chaos” In Advances in Neural Information Processing Systems, 2017, pp. 2865–2873
  13. Soufiane Hayou, Arnaud Doucet and Judith Rousseau “On the impact of the activation function on deep neural networks training” In International Conference on Machine Learning, 2019, pp. 2672–2680 PMLR
  14. Michael Murray, Vinayak Abrol and Jared Tanner “Activation function design for deep networks: linearity and effective initialisation” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 117–154
  15. Lechao Xiao, Jeffrey Pennington and Samuel Schoenholz “Disentangling trainability and generalization in deep neural networks” In International Conference on Machine Learning, 2020, pp. 10462–10472 PMLR
  16. “On layer normalization in the transformer architecture” In International Conference on Machine Learning, 2020, pp. 10524–10533 PMLR
  17. “Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping”, 2021 arXiv:2110.01765
  18. Guodong Zhang, Aleksandar Botev and James Martens “Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers”, 2022 arXiv:2203.08120
  19. Mufan Li, Mihai Nica and Dan Roy “The neural covariance SDE: Shaped infinite depth-and-width networks at initialization” In Advances in Neural Information Processing Systems 35, 2022, pp. 10795–10808
  20. “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer”, 2022 arXiv:2203.03466
  21. Radford M Neal “Bayesian learning for neural networks”, 1995
  22. “Deep Neural Networks as Gaussian Processes” In ICLR, 2018
  23. “Gaussian process behaviour in wide deep neural networks”, 2018 arXiv:1804.11271
  24. “Infinite attention: NNGP and NTK for deep attention networks” In International Conference on Machine Learning, 2020, pp. 4376–4386 PMLR
  25. Arthur Jacot, Franck Gabriel and Clément Hongler “Neural tangent kernel: Convergence and generalization in neural networks” In Advances in Information Processing Systems (NeurIPS), 2018 arXiv:1806.07572
  26. Greg Yang and Edward J Hu “Feature learning in infinite-width neural networks”, 2020 arXiv:2011.14522
  27. “Mean field analysis of neural networks: A law of large numbers” In SIAM Journal on Applied Mathematics 80.2 SIAM, 2020, pp. 725–752
  28. Song Mei, Andrea Montanari and Phan-Minh Nguyen “A mean field view of the landscape of two-layer neural networks” In Proceedings of the National Academy of Sciences 115.33 National Acad Sciences, 2018, pp. E7665–E7671
  29. “On the global convergence of gradient descent for over-parameterized models using optimal transport” In Advances in Neural Information Processing Systems 31, 2018
  30. Grant M. Rotskoff and Eric Vanden-Eijnden “Trainability and Accuracy of Neural Networks: An Interacting Particle System Approach”, 2018 arXiv:1805.00915
  31. Sho Yaida “Non-Gaussian processes and neural networks at finite widths” In Mathematical and Scientific Machine Learning, 2020, pp. 165–192 PMLR
  32. “Asymptotics of wide networks from feynman diagrams” In arXiv preprint arXiv:1909.11304, 2019
  33. Daniel A Roberts, Sho Yaida and Boris Hanin “The principles of deep learning theory” Cambridge University Press, 2022
  34. “Exact marginal prior distributions of finite Bayesian neural networks” In Advances in Neural Information Processing Systems 34, 2021
  35. “Asymptotics of representation learning in finite Bayesian neural networks” In Advances in Neural Information Processing Systems 34, 2021, pp. 24765–24777
  36. Boris Hanin “Correlation Functions in Random Fully Connected Neural Networks at Finite Width”, 2022 arXiv:2204.01058
  37. Emily Dinan, Sho Yaida and Susan Zhang “Effective Theory of Transformers at Initialization”, 2023 arXiv:2304.02034
  38. “Products of many large random matrices and gradients in deep neural networks” In Communications in Mathematical Physics Springer, 2019, pp. 1–36
  39. “Finite depth and width corrections to the neural tangent kernel”, 2019 arXiv:1909.05989
  40. “On the Random Conjugate Kernel and Neural Tangent Kernel” In International Conference on Machine Learning, 2021, pp. 4359–4368 PMLR
  41. Mufan Bill Li, Mihai Nica and Dan Roy “The future is log-Gaussian: ResNets and their infinite-depth-and-width limit at initialization” In Advances in Neural Information Processing Systems 34, 2021, pp. 7852–7864
  42. “Precise characterization of the prior predictive distribution of deep ReLU networks” In Advances in Neural Information Processing Systems 34, 2021
  43. “Non-asymptotic results for singular values of Gaussian matrix products” In Geometric and Functional Analysis 31.2 Springer, 2021, pp. 268–324
  44. Gernot Akemann, Zdzislaw Burda and Mario Kieburg “From integrable to chaotic systems: Universal local statistics of Lyapunov exponents” In Europhysics Letters 126.4 IOP Publishing, 2019, pp. 40001
  45. JR Ipsen “Lyapunov exponents for products of rectangular real, complex and quaternionic Ginibre matrices” In Journal of Physics A: Mathematical and Theoretical 48.15 IOP Publishing, 2015, pp. 155204
  46. “Spectral radii of large non-Hermitian random matrices” In Journal of Theoretical Probability 30.1 Springer, 2017, pp. 326–364
  47. “Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation”, 2023 arXiv:2302.10322
  48. “Scaling vision transformers to 22 billion parameters”, 2023 arXiv:2302.05442
  49. Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization”, 2016 arXiv:1607.06450
  50. “Width and Depth Limits Commute in Residual Networks”, 2023 arXiv:2302.00453
  51. “Stable resnet” In International Conference on Artificial Intelligence and Statistics, 2021, pp. 1324–1332 PMLR
  52. “Dynamical isometry is achieved in residual networks in a universal way for any activation function” In The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 2221–2230 PMLR
  53. Soufiane Hayou “On the infinite-depth limit of finite-width neural networks”, 2022 arXiv:2210.00688
  54. Greg Yang “Tensor programs ii: Neural tangent kernel for any architecture” In arXiv preprint arXiv:2006.14548, 2020
  55. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification” In Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034
  56. Andreas Veit, Michael J Wilber and Serge Belongie “Residual networks behave like ensembles of relatively shallow networks” In Advances in neural information processing systems 29, 2016
  57. “Stabilizing Transformer Training by Preventing Attention Entropy Collapse”, 2023 arXiv:2303.06296
  58. “Brownian motion and stochastic calculus” Springer Science & Business Media, 2012
  59. Jason Miller “Stochastic Calculus (Lecture Notes)” Cambridge University, http://www.statslab.cam.ac.uk//~jpm205/teaching/lent2016/lecture_notes.pdf, 2015
  60. “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020, pp. 14820–14830
  61. Emmanuel Abbe, Enric Boix Adsera and Theodor Misiakiewicz “The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks” In Conference on Learning Theory, 2022, pp. 4782–4887 PMLR
  62. “High-dimensional asymptotics of feature learning: How one gradient step improves the representation” In Int. Conf. Learning Representations (ICLR), 2022 URL: https://openreview.net/forum?id=akddwRG6EGi
  63. Alexandru Damian, Jason Lee and Mahdi Soltanolkotabi “Neural networks can learn representations with gradient descent” In Conference on Learning Theory, 2022, pp. 5413–5452 PMLR
  64. “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD”, 2022 arXiv:2209.14863
  65. Emmanuel Abbe, Enric Boix-Adsera and Theodor Misiakiewicz “SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics”, 2023 arXiv:2302.11055
  66. Raphaël Berthier, Andrea Montanari and Kangjie Zhou “Learning time-scales in two-layers neural networks”, 2023 arXiv:2303.00055
  67. Peter L Bartlett, Andrea Montanari and Alexander Rakhlin “Deep learning: a statistical viewpoint” In Acta numerica 30 Cambridge University Press, 2021, pp. 87–201
  68. O. Kallenberg “Foundations of Modern Probability”, Probability theory and stochastic modelling Springer, 2021
  69. HuggingFace “Wikipedia data set”, https://huggingface.co/datasets/wikipedia
  70. HuggingFace “Bookcorpus data set”, https://huggingface.co/datasets/bookcorpus
  71. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  72. “GLUE: A multi-task benchmark and analysis platform for natural language understanding” In arXiv preprint arXiv:1804.07461, 2018
  73. HuggingFace “Hugging face Bert implementation”, https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
Citations (22)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com