Approximation of relation functions and attention mechanisms (2402.08856v2)
Abstract: Inner products of neural network feature maps arise in a wide variety of machine learning frameworks as a method of modeling relations between inputs. This work studies the approximation properties of inner products of neural networks. It is shown that the inner product of a multi-layer perceptron with itself is a universal approximator for symmetric positive-definite relation functions. In the case of asymmetric relation functions, it is shown that the inner product of two different multi-layer perceptrons is a universal approximator. In both cases, a bound is obtained on the number of neurons required to achieve a given accuracy of approximation. In the symmetric case, the function class can be identified with kernels of reproducing kernel Hilbert spaces, whereas in the asymmetric case the function class can be identified with kernels of reproducing kernel Banach spaces. Finally, these approximation results are applied to analyzing the attention mechanism underlying Transformers, showing that any retrieval mechanism defined by an abstract preorder can be approximated by attention through its inner product relations. This result uses the Debreu representation theorem in economics to represent preference relations in terms of utility functions.
- “Relational Convolutional Networks: A Framework for Learning Representations of Hierarchical Relations” arXiv, 2023 arXiv:2310.03240 [cs]
- “Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers” In 12th International Conference on Learning Representations, 2024
- Nachman Aronszajn “Theory of reproducing kernels” In Transactions of the American mathematical society 68.3, 1950, pp. 337–404
- Francis Bach “Breaking the Curse of Dimensionality with Convex Neural Networks” arXiv, 2016 arXiv:1412.8690 [cs, math, stat]
- “Neural Networks for Fingerprint Recognition” In neural computation 5.3 MIT Press, 1993, pp. 402–418
- A.R. Barron “Universal Approximation Bounds for Superpositions of a Sigmoidal Function” In IEEE Transactions on Information Theory 39.3, 1993, pp. 930–945 DOI: 10.1109/18.256500
- “Signature Verification Using a” Siamese” Time Delay Neural Network” In Advances in neural information processing systems 6, 1993
- “Error bounds for approximation with neural networks” In Journal of Approximation Theory 112.2, 2001, pp. 235–250
- S. Chopra, R. Hadsell and Y. LeCun “Learning a similarity metric discriminatively, with application to face verification” In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005, pp. 539–546 DOI: 10.1109/CVPR.2005.202
- “Scaling Instruction-Finetuned Language Models” arXiv, 2022 arXiv:2210.11416 [cs]
- G. Cybenko “Approximation by Superpositions of a Sigmoidal Function” In Mathematics of Control, Signals, and Systems 2.4, 1989, pp. 303–314 DOI: 10.1007/BF02551274
- Gerard Debreu “Representation of a Preference Ordering by a Numerical Function” In Decision processes 3 Wiley New York, 1954, pp. 159–165
- “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018 arXiv:1810.04805
- Linhao Dong, Shuang Xu and Bo Xu “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 2018, pp. 5884–5888
- “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”, 2020 arXiv:2010.11929
- Alex Graves, Greg Wayne and Ivo Danihelka “Neural Turing Machines” arXiv, 2014 DOI: 10.48550/arXiv.1410.5401
- “Hybrid Computing Using a Neural Network with Dynamic External Memory” In Nature 538.7626 Nature Publishing Group UK London, 2016, pp. 471–476
- Jean-Yves Jaffray “Existence of a Continuous Utility Function: An Elementary Proof” In Econometrica 43.5/6, 1975, pp. 981 DOI: 10.2307/1911340
- “On Neural Architecture Inductive Biases for Relational Tasks”, 2022 DOI: 10.48550/arXiv.2206.05056
- Gregory Koch, Richard Zemel and Ruslan Salakhutdinov “Siamese Neural Networks for One-Shot Image Recognition” In ICML Deep Learning Workshop 2 Lille, 2015
- Kevin J. Lang “A Time-Delay Neural Network Architecture for Speech Recognition” In Technical Report Carnegie-Mellon University, 1988
- “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022
- “Object-Centric Learning with Slot Attention” arXiv, 2020 DOI: 10.48550/arXiv.2006.15055
- V. Maiorov “Approximation by neural networks and learning theory” In Journal of Complexity 22.1, 2006, pp. 102–117
- Y. Makovoz “Uniform approximation by neural networks” In Journal of Approximation Theory 95.2, 1998, pp. 215–228
- J. Mercer “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations” In Philosophical Transactions of the Royal Society of London. Series A 209 The Royal Society, 1909, pp. 415–446 JSTOR: https://www.jstor.org/stable/91043
- Charles A. Micchelli, Yuesheng Xu and Haizhang Zhang “Universal Kernels” In Journal of Machine Learning Research 7.95, 2006, pp. 2651–2667 URL: http://jmlr.org/papers/v7/micchelli06a.html
- OpenAI “GPT-4 Technical Report” arXiv, 2023 arXiv:2303.08774 [cs]
- P. P. Petrushev “Approximation by ridge functions and neural networks” In SIAM Journal on Mathematical Analysis 30.1, 1998, pp. 155–189
- A. Pinkus “Approximation theory of the MLP model in neural networks” In Acta Numerica 8, 1999, pp. 143–195
- “Why and When Can Deep-but Not Shallow-Networks Avoid the Curse of Dimensionality: A Review” In International Journal of Automation and Computing 14.5, 2017, pp. 503–519 DOI: 10.1007/s11633-017-1054-2
- “Neural Episodic Control” In International Conference on Machine Learning PMLR, 2017, pp. 2827–2836
- “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
- David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams “Learning Representations by Back-Propagating Errors” In nature 323.6088 Nature Publishing Group UK London, 1986, pp. 533–536
- “Relational Recurrent Neural Networks” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018
- Caroline E. Seely “Non-Symmetric Kernels of Positive Type” In The Annals of Mathematics 20.3, 1919, pp. 172–176 DOI: 10.2307/1967866
- Hongwei Sun “Mercer Theorem for RKHS on Noncompact Sets” In Journal of Complexity 21.3, 2005, pp. 337–349 DOI: 10.1016/j.jco.2004.09.002
- “Alpaca: A Strong, Replicable Instruction-Following Model” Stanford Center for Research on Foundation Models, 2023
- “Llama 2: Open Foundation and Fine-Tuned Chat Models” arXiv, 2023 DOI: 10.48550/arXiv.2307.09288
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “Graph Attention Networks” In arXiv preprint arXiv:1710.10903, 2017 arXiv:1710.10903
- Taylor W. Webb, Ishan Sinha and Jonathan D. Cohen “Emergent Symbols through Binding in External Memory”, 2021 DOI: 10.48550/arXiv.2012.14601
- Matthew A. Wright and Joseph E. Gonzalez “Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines”, 2021 arXiv:2106.01506 [cs.LG]
- “An Explicit Neural Network Construction for Piecewise Constant Function Approximation” In arXiv preprint arXiv:1808.07390, 2018 arXiv:1808.07390
- “Deep Reinforcement Learning with Relational Inductive Biases” In International Conference on Learning Representations, 2018
- Haizhang Zhang, Yuesheng Xu and Jun Zhang “Reproducing Kernel Banach Spaces for Machine Learning” In 2009 International Joint Conference on Neural Networks Atlanta, Ga, USA: IEEE, 2009, pp. 3520–3527 DOI: 10.1109/IJCNN.2009.5179093