SimA: Simple Softmax-free Attention for Vision Transformers (2206.08898v2)

Published 17 Jun 2022 in cs.CV

Abstract: Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple $\ell_1$-norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which simplifies the attention block further. The code is available here: https://github.com/UCDvision/sima

PDF Abstract

SimA: A Simple and Effective Softmax-Free Attention Mechanism for Vision Transformers

Introduction

Vision transformers have emerged as a competitive alternative to traditional convolutional neural networks (CNNs) for various computer vision tasks. Despite their superior performance, the deployment of vision transformers is limited by their computational demands, especially the Softmax operation in their attention mechanisms. In this paper, researchers introduce SimA, a novel Softmax-free attention mechanism designed to alleviate the computational bottleneck while maintaining, or in some cases, enhancing model performance. SimA utilizes a simple normalization process based on the $\ell_1$ -norm to facilitate competition among tokens without the need for the computationally intensive Softmax operation.

Methodology

SimA, short for Simple Attention, replaces the Softmax layer in the attention mechanism with a straightforward normalization process. The query and key matrices are individually normalized using their $\ell_1$ -norms. This normalization ensures that each token contributes averagely to the attention mechanism, mitigating the computational complexity associated with the Softmax function. Moreover, the simplicity of this mechanism allows for dynamic adjustment of the computational strategy based on the relative sizes of the token set and the channel dimensions.

Innovations in Attention Mechanism

SimA introduces several innovative features in its design:

Softmax-Free Normalization: By normalizing the query and key matrices using $\ell_1$ -norms, SimA dispenses with the need for the Softmax operation, significantly reducing the computational overhead.
Dynamic Computational Strategy: Given the associative property of matrix multiplication, SimA can dynamically choose between two computational pathways, optimizing performance based on the context at test time.
Stability and Efficiency: SimA's normalization technique offers numerical stability, allowing for the efficient use of half-precision floating points. This property is particularly beneficial for deployment on resource-constrained devices.

Performance and Application

SimA was empirically tested across multiple state-of-the-art vision transformer architectures, including DeiT, XCiT, and CvT, demonstrating on-par accuracy levels. The experiments covered a range of applications, from image classification on ImageNet to object detection and segmentation on MS-COCO, and even self-supervised learning tasks. Interestingly, it was found that moving from multi-head to single-head attention within the SimA framework has a negligible impact on accuracy, further streamlining the attention mechanism.

Implications and Future Directions

The development of SimA has both theoretical and practical implications for the field of AI and machine learning. Theoretically, it challenges the conventional reliance on Softmax for attention mechanisms, providing a simpler and equally effective alternative. Practically, SimA presents an opportunity for the broader adoption of transformer models, particularly in settings where computational resources are limited.

The authors speculate on several future developments stemming from this research. These include further exploration of SimA's potential in different transformer-based models and applications, deeper investigation into the removal of Softmax and its effects on model interpretability and stability, and the adaptation of SimA for use in other forms of neural networks beyond vision transformers.

Conclusion

SimA represents a significant step forward in the optimization of attention mechanisms within transformer models. Through its Softmax-free design and dynamic computational flexibility, SimA offers an efficient, stable, and scalable solution that could potentially expand the applicability of transformers to a wider range of platforms and applications. The simplicity and effectiveness of SimA are poised to inspire further innovations in the field, driving towards more computationally efficient AI models without compromising performance.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Soroush Abbasi Koohpayegani (17 papers)
Hamed Pirsiavash (50 papers)

Citations (22)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - UCDvision/sima: Official implementation for "SimA: Simple Softmax-free Attention for Vision Transformers" (43 stars)