Separable Self-attention for Mobile Vision Transformers (2206.02680v1)

Published 6 Jun 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

PDF Abstract

Separable Self-attention for Mobile Vision Transformers

The paper presented introduces a novel approach to addressing the computational inefficiencies of Mobile Vision Transformers (MobileViT), specifically focusing on reducing latency through separable self-attention mechanisms. MobileViT models have demonstrated state-of-the-art performance across several mobile vision tasks, including classification and detection. However, the efficiency bottleneck in these models lies within the multi-headed self-attention (MHA) component of transformers, which traditionally requires $O(k^2)$ time complexity concerning the number of tokens (or patches) $k$ .

Key Contributions

Separable Self-attention: The primary contribution of this work is the proposal of a separable self-attention mechanism with linear time complexity, $O(k)$ . This approach replaces the costly batch-wise matrix multiplication with more efficient element-wise operations, making it well-suited for resource-constrained devices. The separable self-attention utilizes a latent token to compute context scores, which re-weights input tokens to encode global information efficiently.
MobileViTv2: By integrating the separable self-attention into the MobileViT architecture, the authors present MobileViTv2. This new model achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming its predecessor by approximately 1% and running 3.2 times faster on mobile devices. The MobileViTv2 architecture scales efficiently across different complexities by using a width multiplier.
Comparative Analysis: The separable self-attention was compared against traditional and Linformer-based self-attention. Results highlighted the efficiency of the proposed method both at module-level and architectural-level, showing significant improvements in speed without compromising accuracy.

Experimental Validation

The paper provides a thorough experimental validation across various tasks:

Object Classification: MobileViTv2 models outperformed existing transformer-based models and achieved performance levels comparable to CNN-based architectures while bridging the latency gap, particularly on resource-constrained devices.
Semantic Segmentation and Object Detection: The integration of MobileViTv2 into standard architectures such as PSPNet and DeepLabv3 demonstrated efficient performance on ADE20k and PASCAL VOC datasets. For object detection on MS-COCO, MobileViTv2 showed competitive results with significantly fewer parameters and FLOPs.

Implications and Future Work

The development of separable self-attention suggests promising implications for deploying transformer models on resource-constrained devices, such as mobile phones. It lowers the computational burden, thus extending the practical application of vision transformers in real-time scenarios. The approach could potentially be adapted and extended to other transformer-based architectures, such as those used in natural language processing, to enhance performance on devices with limited resources.

Future developments could explore further optimization of the separable self-attention mechanism, including investigating the utilization of multiple latent tokens or alternative projection strategies to push the boundaries of efficiency without trade-offs in performance. Additionally, there is potential for implementing hardware-specific optimizations that could further enhance the applicability of MobileViTv2 in diverse and constrained environments.

In summary, this work presents a significant step towards making vision transformers more viable for real-time applications on mobile and other resource-constrained platforms, opening up opportunities for expanded use and innovation in mobile computing.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Sachin Mehta (48 papers)
Mohammad Rastegari (57 papers)

Citations (185)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - apple/ml-cvnets: CVNets: A library for training computer vision networks (1,705 stars)