2000 character limit reached
Inverse distance weighting attention (2310.18805v2)
Published 28 Oct 2023 in cs.LG
Abstract: We report the effects of replacing the scaled dot-product (within softmax) attention with the negative-log of Euclidean distance. This form of attention simplifies to inverse distance weighting interpolation. Used in simple one hidden layer networks and trained with vanilla cross-entropy loss on classification problems, it tends to produce a key matrix containing prototypes and a value matrix with corresponding logits. We also show that the resulting interpretable networks can be augmented with manually-constructed prototypes to perform low-impact handling of special cases.
- Luca Ambrogioni. In search of dispersed memories: Generative diffusion models are associative memory networks. arXiv preprint arXiv:2309.17290, 2023.
- Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In Proceedings of the 8th ACM Conference on Recommender systems, pages 257–264, 2014.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Conditionally positive definite kernels for svm based image recognition. In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116. IEEE, 2005.
- Attention approximates sparse distributed memory. Advances in Neural Information Processing Systems, 34:15301–15315, 2021.
- Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Memory in plain sight: A survey of the uncanny resemblances between diffusion models and associative memories, 2023.
- John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Heavy-tailed kernels reveal a finer cluster structure in t-sne visualisations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 124–139. Springer, 2019.
- Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM national conference, pages 517–524, 1968.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.