Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers (2405.15926v2)
Abstract: Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., $N,P\rightarrow\infty$, $P/N=\mathcal{O}(1)$, where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the task labels. As a consequence, this interplay between attention paths enhances generalization performance. Experiments confirm our findings on both synthetic and real-world sequence classification tasks. Finally, our theory explicitly relates the kernel combination mechanism to properties of the learned weights, allowing for a qualitative transfer of its insights to models trained via gradient descent. As an illustration, we demonstrate an efficient size reduction of the network, by pruning those attention heads that are deemed less relevant by our theory.
- Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 5998–6008, Long Beach, CA, USA, December 2017.
- Structured attention networks. In Int. Conf. on Learning Representations (ICLR), Toulon, France, April 2017.
- Neural machine translation by jointly learning to align and translate. In Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
- Tom B Brown et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only, December 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. on Learning Representations (ICLR), Virtual only, May 2021.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), pages 4171–4186, Minneapolis, MN, USA, June 2019.
- What can a single attention layer learn? A study through the random features lens. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023.
- Unraveling attention via convex duality: Analysis and interpretations of vision transformers. In Proc. Int. Conf. on Machine Learning (ICML), Baltimore, Maryland, USA, July 2022.
- Max-margin token selection in attention mechanism. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023.
- A phase transition between positional and semantic learning in a solvable model of dot-product attention. Preprint arXiv:2402.03902, 2024.
- Mapping of attention mechanisms to a generalized potts model. Phys. Rev. Res., 6:023057, Apr 2024.
- The shaped transformer: Attention models in the infinite depth-and-width limit. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023.
- A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In Int. Conf. on Learning Representations (ICLR), Kigali, Rwanda, May 2023a.
- Transformers learn through gradual rank increase. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023.
- Scan and Snap: Understanding training dynamics and token composition in 1-layer transformer. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023.
- A mathematical perspective on transformers. Preprint arXiv:2312.10794, 2023.
- Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
- Inductive biases and variable creation in self-attention mechanisms. In Proc. Int. Conf. on Machine Learning (ICML), Baltimore, MD, USA, July 2022.
- Infinite attention: NNGP and NTK for deep attention networks. In Proc. Int. Conf. on Machine Learning (ICML), pages 4376–4386, Virtual only, July 2020.
- Towards understanding inductive bias in transformers: A view from infinity. Preprint arXiv:2402.05173, 2024.
- Deep neural networks as Gaussian processes. In Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, April 2018.
- Gaussian process behaviour in wide deep neural networks. In Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, April 2018.
- Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021.
- Bayesian interpolation with deep linear networks. Proc. of the National Academy of Sciences (PNAS), 120(23), May 2023.
- A statistical mechanics framework for bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, 2023.
- Bayes-optimal learning of deep random networks of extensive-width. In Proc. Int. Conf. on Machine Learning (ICML), Honolulu, HI, USA, July 2023.
- Globally gated deep linear networks. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, November 2022.
- Consistent inference of probabilities in layered networks: predictions and generalizations. In Proc. Int. Joint Conf. on Neural Networks (IJCNN), pages 403–409, Washington, DC, USA, June 1989.
- David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
- Radford M Neal. Bayesian learning for neural networks. Springer, 1996.
- Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- The MNIST database of handwritten digits. URL http://yann.lecun.com/exdb/mnist, 1998.
- Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint arXiv:1708.07747, 2017.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proc. Association for Computational Linguistics (ACL), pages 5797–5808, Florence, Italy, July 2019.
- Llama: Open and efficient foundation language models. Preprint arXiv:2302.13971, 2023.
- Tinystories: How small can language models be and still speak coherent english? Preprint arXiv:2305.07759, 2023.
- Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020.
- Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023a.
- A mathematical theory of semantic development in deep neural networks. Proc. of the National Academy of Sciences (PNAS), 116(23):11537–11546, 2019.
- Self-consistent dynamical field theory of kernel evolution in wide neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2023(11):114009, 2023b.
- Connecting NTK and NNGP: A unified theoretical framework for neural network learning dynamics in the kernel regime. Preprint arXiv:2309.04522, 2023.
- Are transformers universal approximators of sequence-to-sequence functions? In Int. Conf. on Learning Representations (ICLR), Virtual only, April 2020.
- Vision transformers provably learn spatial structure. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, November 2022.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Proc. Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, December 2023.
- How do transformers learn in-context beyond simple functions? a case study on learning with representations. In Int. Conf. on Learning Representations (ICLR), Vienna, Austria, 2024.
- Transformers as algorithms: Generalization and stability in in-context learning. In Proc. Int. Conf. on Machine Learning (ICML), Honolulu, HI, USA, July 2023b.
- Trained transformers learn linear models in-context. Journal of Machine Learning Research (JMLR), 25(49):1–55, 2024.
- What and how does in-context learning learn? Bayesian model averaging, parameterization, and generalization. Preprint arXiv:2305.19420, 2023.
- Mikhail Vasil’evich Fedoryuk. The saddle-point method. Nauka, Moscow, 1977.
- Mikhail Vasil’evich Fedoryuk. Asymptotic methods in analysis. In Analysis I: integral representations and asymptotic methods, pages 83–191. Springer, 1989.
- Adam: A method for stochastic optimization. In Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
- Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. Preprint arXiv:1701.02434, 2018.
- Composable effects for flexible and accelerated probabilistic programming in NumPyro. Preprint arXiv:1912.11554, 2019.
- Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research (JMLR), 20(28):1–6, 2019.
- Matching networks for one shot learning. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 3630–3638, Barcelona, Spain, December 2016.
- Torchmeta: A meta-learning library for PyTorch. Preprint arXiv:1909.06576, 2019.
- Lorenzo Tiberi (4 papers)
- Francesca Mignacco (13 papers)
- Kazuki Irie (35 papers)
- Haim Sompolinsky (26 papers)