Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Approximation of relation functions and attention mechanisms (2402.08856v2)

Published 13 Feb 2024 in cs.LG and stat.ML

Abstract: Inner products of neural network feature maps arise in a wide variety of machine learning frameworks as a method of modeling relations between inputs. This work studies the approximation properties of inner products of neural networks. It is shown that the inner product of a multi-layer perceptron with itself is a universal approximator for symmetric positive-definite relation functions. In the case of asymmetric relation functions, it is shown that the inner product of two different multi-layer perceptrons is a universal approximator. In both cases, a bound is obtained on the number of neurons required to achieve a given accuracy of approximation. In the symmetric case, the function class can be identified with kernels of reproducing kernel Hilbert spaces, whereas in the asymmetric case the function class can be identified with kernels of reproducing kernel Banach spaces. Finally, these approximation results are applied to analyzing the attention mechanism underlying Transformers, showing that any retrieval mechanism defined by an abstract preorder can be approximated by attention through its inner product relations. This result uses the Debreu representation theorem in economics to represent preference relations in terms of utility functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Relational Convolutional Networks: A Framework for Learning Representations of Hierarchical Relations” arXiv, 2023 arXiv:2310.03240 [cs]
  2. “Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers” In 12th International Conference on Learning Representations, 2024
  3. Nachman Aronszajn “Theory of reproducing kernels” In Transactions of the American mathematical society 68.3, 1950, pp. 337–404
  4. Francis Bach “Breaking the Curse of Dimensionality with Convex Neural Networks” arXiv, 2016 arXiv:1412.8690 [cs, math, stat]
  5. “Neural Networks for Fingerprint Recognition” In neural computation 5.3 MIT Press, 1993, pp. 402–418
  6. A.R. Barron “Universal Approximation Bounds for Superpositions of a Sigmoidal Function” In IEEE Transactions on Information Theory 39.3, 1993, pp. 930–945 DOI: 10.1109/18.256500
  7. “Signature Verification Using a” Siamese” Time Delay Neural Network” In Advances in neural information processing systems 6, 1993
  8. “Error bounds for approximation with neural networks” In Journal of Approximation Theory 112.2, 2001, pp. 235–250
  9. S. Chopra, R. Hadsell and Y. LeCun “Learning a similarity metric discriminatively, with application to face verification” In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005, pp. 539–546 DOI: 10.1109/CVPR.2005.202
  10. “Scaling Instruction-Finetuned Language Models” arXiv, 2022 arXiv:2210.11416 [cs]
  11. G. Cybenko “Approximation by Superpositions of a Sigmoidal Function” In Mathematics of Control, Signals, and Systems 2.4, 1989, pp. 303–314 DOI: 10.1007/BF02551274
  12. Gerard Debreu “Representation of a Preference Ordering by a Numerical Function” In Decision processes 3 Wiley New York, 1954, pp. 159–165
  13. “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018 arXiv:1810.04805
  14. Linhao Dong, Shuang Xu and Bo Xu “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 2018, pp. 5884–5888
  15. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”, 2020 arXiv:2010.11929
  16. Alex Graves, Greg Wayne and Ivo Danihelka “Neural Turing Machines” arXiv, 2014 DOI: 10.48550/arXiv.1410.5401
  17. “Hybrid Computing Using a Neural Network with Dynamic External Memory” In Nature 538.7626 Nature Publishing Group UK London, 2016, pp. 471–476
  18. Jean-Yves Jaffray “Existence of a Continuous Utility Function: An Elementary Proof” In Econometrica 43.5/6, 1975, pp. 981 DOI: 10.2307/1911340
  19. “On Neural Architecture Inductive Biases for Relational Tasks”, 2022 DOI: 10.48550/arXiv.2206.05056
  20. Gregory Koch, Richard Zemel and Ruslan Salakhutdinov “Siamese Neural Networks for One-Shot Image Recognition” In ICML Deep Learning Workshop 2 Lille, 2015
  21. Kevin J. Lang “A Time-Delay Neural Network Architecture for Speech Recognition” In Technical Report Carnegie-Mellon University, 1988
  22. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022
  23. “Object-Centric Learning with Slot Attention” arXiv, 2020 DOI: 10.48550/arXiv.2006.15055
  24. V. Maiorov “Approximation by neural networks and learning theory” In Journal of Complexity 22.1, 2006, pp. 102–117
  25. Y. Makovoz “Uniform approximation by neural networks” In Journal of Approximation Theory 95.2, 1998, pp. 215–228
  26. J. Mercer “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations” In Philosophical Transactions of the Royal Society of London. Series A 209 The Royal Society, 1909, pp. 415–446 JSTOR: https://www.jstor.org/stable/91043
  27. Charles A. Micchelli, Yuesheng Xu and Haizhang Zhang “Universal Kernels” In Journal of Machine Learning Research 7.95, 2006, pp. 2651–2667 URL: http://jmlr.org/papers/v7/micchelli06a.html
  28. OpenAI “GPT-4 Technical Report” arXiv, 2023 arXiv:2303.08774 [cs]
  29. P. P. Petrushev “Approximation by ridge functions and neural networks” In SIAM Journal on Mathematical Analysis 30.1, 1998, pp. 155–189
  30. A. Pinkus “Approximation theory of the MLP model in neural networks” In Acta Numerica 8, 1999, pp. 143–195
  31. “Why and When Can Deep-but Not Shallow-Networks Avoid the Curse of Dimensionality: A Review” In International Journal of Automation and Computing 14.5, 2017, pp. 503–519 DOI: 10.1007/s11633-017-1054-2
  32. “Neural Episodic Control” In International Conference on Machine Learning PMLR, 2017, pp. 2827–2836
  33. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
  34. David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams “Learning Representations by Back-Propagating Errors” In nature 323.6088 Nature Publishing Group UK London, 1986, pp. 533–536
  35. “Relational Recurrent Neural Networks” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018
  36. Caroline E. Seely “Non-Symmetric Kernels of Positive Type” In The Annals of Mathematics 20.3, 1919, pp. 172–176 DOI: 10.2307/1967866
  37. Hongwei Sun “Mercer Theorem for RKHS on Noncompact Sets” In Journal of Complexity 21.3, 2005, pp. 337–349 DOI: 10.1016/j.jco.2004.09.002
  38. “Alpaca: A Strong, Replicable Instruction-Following Model” Stanford Center for Research on Foundation Models, 2023
  39. “Llama 2: Open Foundation and Fine-Tuned Chat Models” arXiv, 2023 DOI: 10.48550/arXiv.2307.09288
  40. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  41. “Graph Attention Networks” In arXiv preprint arXiv:1710.10903, 2017 arXiv:1710.10903
  42. Taylor W. Webb, Ishan Sinha and Jonathan D. Cohen “Emergent Symbols through Binding in External Memory”, 2021 DOI: 10.48550/arXiv.2012.14601
  43. Matthew A. Wright and Joseph E. Gonzalez “Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines”, 2021 arXiv:2106.01506 [cs.LG]
  44. “An Explicit Neural Network Construction for Piecewise Constant Function Approximation” In arXiv preprint arXiv:1808.07390, 2018 arXiv:1808.07390
  45. “Deep Reinforcement Learning with Relational Inductive Biases” In International Conference on Learning Representations, 2018
  46. Haizhang Zhang, Yuesheng Xu and Jun Zhang “Reproducing Kernel Banach Spaces for Machine Learning” In 2009 International Joint Conference on Neural Networks Atlanta, Ga, USA: IEEE, 2009, pp. 3520–3527 DOI: 10.1109/IJCNN.2009.5179093
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com