Mapping of attention mechanisms to a generalized Potts model (2304.07235v4)
Abstract: Transformers are neural networks that revolutionized natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked LLMing (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalized Potts model with interactions between sites and Potts colors. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalization error of self-attention in a model scenario analytically using the replica method.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” (2019), arXiv:1810.04805 .
- J. Howard and S. Ruder, (2018), arXiv:1801.06146 .
- A. Radford, K. Narasimhan, et al., “Improving language understanding by generative pre-training,” (2018).
- OpenAI, arXiv preprint arXiv:2303.08774v2 (2023).
- E. Gardner and B. Derrida, Journal of Physics A: Mathematical and General 22, 1983 (1989).
- A. Engel and C. Van den Broeck, Statistical Mechanics of Learning (Cambridge University Press, 2001).
- A. Ingrosso and S. Goldt, Proceedings of the National Academy of Sciences 119, e2201854119 (2022).
- M. Refinetti and S. Goldt, in International Conference on Machine Learning (PMLR, 2022) pp. 18499–18519.
- H. Cui and L. Zdeborová, arXiv:2305.11041 (2023).
- R. B. Potts, Mathematical Proceedings of the Cambridge Philosophical Society 48, 106–109 (1952).
- F. Y. Wu, Rev. Mod. Phys. 54, 235 (1982).
- Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” (2021), arXiv:2106.04803 .
- R. Rende, L. L. Viteritti, L. Bardone, F. Becca, and S. Goldt, “A simple linear algebra identity to optimize large-scale neural network quantum states,” (2023), arXiv:2310.05715 [cond-mat.str-el] .
- L. L. Viteritti, R. Rende, A. Parola, S. Goldt, and F. Becca, “Transformer wave function for the shastry-sutherland model: emergence of a spin-liquid phase,” (2023), arXiv:2311.16889 [cond-mat.str-el] .
- J. Besag, Journal of the Royal Statistical Society. Series D (The Statistician) 24, 179 (1975).
- S. Cocco and R. Monasson, Phys. Rev. Lett. 106, 090601 (2011).
- F. Ricci-Tersenghi, Journal of Statistical Mechanics: Theory and Experiment 2012, P08015 (2012).
- A. Hyvärinen, Neural Computation 18, 2283 (2006).
- E. Aurell and M. Ekeberg, Phys. Rev. Lett. 108, 090201 (2012).
- D. Saad and S. A. Solla, Phys. Rev. Lett. 74, 4337 (1995).
- M. Biehl and H. Schwarze, Journal of Physics A: Mathematical and General 28, 643 (1995).
- P. Lippe, “UvA Deep Learning Tutorials,” https://uvadlc-notebooks.readthedocs.io/en/latest/ (2022).
- Riccardo Rende (10 papers)
- Federica Gerace (13 papers)
- Alessandro Laio (43 papers)
- Sebastian Goldt (33 papers)