Rethinking Attention: Polynomial Alternatives to Softmax in Transformers (2410.18613v2)
Abstract: This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the Frobenius norm of the attention matrix, which stabilizes training. Motivated by this, we explore alternative activations, specifically polynomials, that achieve a similar regularization effect. Our theoretical analysis shows that certain polynomials can serve as effective substitutes for softmax, achieving strong performance across transformer applications despite violating softmax's typical properties of positivity, normalization, and sparsity. Extensive experiments support these findings, offering a new perspective on attention mechanisms.
- Titsias RC AUEB et al. One-vs-each approximation to softmax for scalable estimation of probabilities. Advances in Neural Information Processing Systems, 29, 2016.
- Exploring alternatives to softmax function. arXiv preprint arXiv:2011.11538, 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Springer, 2020.
- Understanding the regularity of self-attention with optimal transport. arXiv preprint arXiv:2312.14820, 2023.
- Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Transformers with learnable activation functions. arXiv preprint arXiv:2208.14111, 2022.
- Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 910–919, 2024.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- The lipschitz constant of self-attention. In International Conference on Machine Learning, pp. 5562–5571. PMLR, 2021.
- Sima: Simple softmax-free attention for vision transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2607–2617, 2024.
- Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- Transfusion: Multi-modal fusion network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6536–6546, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Trajectron++: Multi-agent generative trajectory forecasting with heterogeneous data for control. arXiv preprint arXiv:2001.03093, 2, 2020.
- A study on relu and softmax in transformer. arXiv preprint arXiv:2302.06461, 2023.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Escaping the gradient vanishing: Periodic alternatives of softmax in attention mechanism. IEEE Access, 9:168749–168759, 2021.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- cosformer: rethinking softmax in attention. In International Conference on Learning Representations, 2022.
- A robustly optimized bert pre-training approach with post-training. In Proceedings of the 20th chinese national conference on computational linguistics, pp. 1218–1227, 2021.