SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications (2303.15446v2)

Published 27 Mar 2023 in cs.CV

Abstract: Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Code: https://github.com/Amshaker/SwiftFormer

PDF Abstract

An Analysis of SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

The paper proposes a novel approach to improving transformer efficiency for real-time mobile vision applications by introducing an efficient additive attention mechanism. This research addresses a significant issue in deploying vision transformers (ViTs) on resource-constrained mobile devices, where traditional self-attention mechanisms with quadratic computational complexity impede real-time application.

Key Contributions

Efficient Additive Attention: The paper introduces an efficient additive attention mechanism that substitutes expensive matrix multiplication with linear element-wise multiplications. This novel mechanism eliminates the need for explicit key-value interactions in the traditional query-key-value (QKV) pattern, favoring a simpler linear transformation. This modification reduces computational complexity and enables the inclusion of attention mechanisms at all network stages, enhancing both speed and model robustness.
Hybrid Architecture: The authors propose a hybrid model called "SwiftFormer," which combines convolutional neural networks (CNNs) and transformers across various scales. Unlike previous hybrid models that reserve self-attention for later stages, SwiftFormer employs attention mechanisms consistently across all stages, contributing to effective local-global feature representation.
Performance Metrics: The SwiftFormer models demonstrate impressive performance metrics. The small variant achieves 78.5% top-1 ImageNet-1K accuracy with a latency of just 0.8 ms on the iPhone 14, outperforming existing counterparts like MobileViT-v2 in both accuracy and speed. The larger variants further elevate accuracy but maintain latency efficiency, underscoring the practical applicability of SwiftFormer in mobile environments.

Empirical Validation

The empirical results validate the proposed method's efficacy across multiple vision tasks, including image classification, object detection, and semantic segmentation. SwiftFormer models exhibit superior performance over other lightweight and hybrid models, such as MobileNetV3 and EfficientFormer, particularly emphasizing gains in tasks necessitating real-time inference. The innovative architecture allows for state-of-the-art performance by optimizing computational costs while preserving or enhancing accuracy.

Implications and Future Directions

The implications of this research are profound for real-world mobile applications, where computational resources are limited, and efficiency is paramount. The refinement of additive attention could revolutionize mobile deep learning models, providing an effective balance between model complexity and the need for speed and accuracy.

For future developments, it is worth exploring further optimizations in hardware-specific implementations. Given the linear complexity of the efficient additive attention, there’s potential to adapt SwiftFormer across a broader range of mobile architectures without sacrificing real-time inference capabilities. Additionally, extending the use of efficient additive attention to other domains within AI, such as natural language processing for mobile devices, might also prove fruitful.

The paper’s results and proposed methodology suggest significant evolution in transformer-based models' design, paving the way for broader applications and showcasing potential use cases in areas that demand real-time data processing with minimized computational overhead. Researchers and practitioners should aim to build on this foundation, exploring tailored optimizations and extensions of efficient additive attention beyond the scope of vision applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Abdelrahman Shaker (14 papers)
Muhammad Maaz (23 papers)
Hanoona Rasheed (13 papers)
Salman Khan (244 papers)
Ming-Hsuan Yang (376 papers)
Fahad Shahbaz Khan (225 papers)

Citations (59)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Amshaker/SwiftFormer: [ICCV'23] Official repository of paper SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications (253 stars)