Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers (2410.18613v2)

Published 24 Oct 2024 in cs.LG, cs.CV, and stat.ML

Abstract: This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the Frobenius norm of the attention matrix, which stabilizes training. Motivated by this, we explore alternative activations, specifically polynomials, that achieve a similar regularization effect. Our theoretical analysis shows that certain polynomials can serve as effective substitutes for softmax, achieving strong performance across transformer applications despite violating softmax's typical properties of positivity, normalization, and sparsity. Extensive experiments support these findings, offering a new perspective on attention mechanisms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Titsias RC AUEB et al. One-vs-each approximation to softmax for scalable estimation of probabilities. Advances in Neural Information Processing Systems, 29, 2016.
  2. Exploring alternatives to softmax function. arXiv preprint arXiv:2011.11538, 2020.
  3. End-to-end object detection with transformers. In European conference on computer vision, pp.  213–229. Springer, 2020.
  4. Understanding the regularity of self-attention with optimal transport. arXiv preprint arXiv:2312.14820, 2023.
  5. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Transformers with learnable activation functions. arXiv preprint arXiv:2208.14111, 2022.
  9. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  910–919, 2024.
  10. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  11. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  12. The lipschitz constant of self-attention. In International Conference on Machine Learning, pp.  5562–5571. PMLR, 2021.
  13. Sima: Simple softmax-free attention for vision transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  2607–2617, 2024.
  14. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  15. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  16. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  17. Transfusion: Multi-modal fusion network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6536–6546, 2023.
  18. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  19. Trajectron++: Multi-agent generative trajectory forecasting with heterogeneous data for control. arXiv preprint arXiv:2001.03093, 2, 2020.
  20. A study on relu and softmax in transformer. arXiv preprint arXiv:2302.06461, 2023.
  21. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  22. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  23. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp.  10347–10357. PMLR, 2021.
  24. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  25. Escaping the gradient vanishing: Periodic alternatives of softmax in attention mechanism. IEEE Access, 9:168749–168759, 2021.
  26. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  27. Nyströmformer: A nyström-based algorithm for approximating self-attention. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  28. cosformer: rethinking softmax in attention. In International Conference on Learning Representations, 2022.
  29. A robustly optimized bert pre-training approach with post-training. In Proceedings of the 20th chinese national conference on computational linguistics, pp.  1218–1227, 2021.

Summary

  • The paper reveals that softmax’s key contribution lies in its implicit regularization of the attention matrix’s Frobenius norm, promoting stable training.
  • Empirical results show that scaled polynomial activations, such as cubic forms, rival or exceed softmax performance on image classification and object detection tasks.
  • These insights challenge the traditional probability-based view, opening avenues for norm-centric design of advanced attention mechanisms in transformers.

Rethinking Softmax: Self-Attention with Polynomial Activations

The paper entitled "Rethinking Softmax: Self-Attention with Polynomial Activations" presents a theoretical and empirical examination of the canonical role of softmax in transformer architectures, contesting the prevailing assumption that its primary success stems from generating a probability distribution for attention allocation. The authors argue that the effectiveness of softmax is more deeply rooted in its implicit regularization of the Frobenius norm of the attention matrix during training.

Theoretical Insights

The foundation of the paper lies in uncovering the theoretical underpinnings of softmax's regularization properties. The authors present a theorem elucidating that the Frobenius norm of the self-attention matrix under softmax grows sub-linearly with respect to the input sequence length. This regulation is further mirrored in the gradient norms, indicating stable training dynamics which are vital for gradient descent methods. The paper extends this insight by deriving polynomial activations that similarly regularize the Frobenius norm, showcasing that polynomial mappings such as those of the form 1Nxp\frac{1}{\sqrt{N}}x^p achieve comparable performance constraints.

Empirical Evaluation

The empirical investigations focused on validating the theoretical claims by benchmarking these polynomial activations against softmax across several transformer tasks. The experiments highlighted that polynomial forms, particularly 116x3\frac{1}{16}x^3 and 116x\frac{1}{16}x, not only rival softmax in terms of accuracy but also surpass it under certain conditions. These findings were consistent across multiple architectures including ViT, DeiT, Swin Transformer, and XCiT, evaluated on standard datasets like ImageNet-1k and COCO.

In tasks such as image classification and object detection, these polynomial activations were shown to be highly effective when appropriately scaled. The paper offers visual insights into how these activations alter the attention landscapes, demonstrating distinct patterns that challenge our traditional understanding shaped by softmax.

Implications and Future Directions

The implications of this paper are pivotal: it redefines the rationale behind attention mechanisms' success, urging a shift from probability-centric interpretations towards norm-based regularization perspectives. The results suggest that transformer architectures might not be inherently dependent on probabilistic attention distributions, opening pathways to design novel activation functions that can further enhance model robustness and efficiency.

Future research could explore extending these theoretical constructs beyond self-attention to other types of attention mechanisms. Additionally, investigating the implications of these findings in other domains like NLP, and further refinement of polynomial activations for optimized performance, would form a compelling continuation of this work.

In conclusion, this paper provides a rigorous theoretical examination and an insightful empirical paper challenging established norms of softmax-based self-attention, paving the way for innovative attention mechanisms in AI systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.