Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Separable Self-attention for Mobile Vision Transformers (2206.02680v1)

Published 6 Jun 2022 in cs.CV, cs.AI, and cs.LG
Separable Self-attention for Mobile Vision Transformers

Abstract: Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

Separable Self-attention for Mobile Vision Transformers

The paper presented introduces a novel approach to addressing the computational inefficiencies of Mobile Vision Transformers (MobileViT), specifically focusing on reducing latency through separable self-attention mechanisms. MobileViT models have demonstrated state-of-the-art performance across several mobile vision tasks, including classification and detection. However, the efficiency bottleneck in these models lies within the multi-headed self-attention (MHA) component of transformers, which traditionally requires O(k2)O(k^2) time complexity concerning the number of tokens (or patches) kk.

Key Contributions

  1. Separable Self-attention: The primary contribution of this work is the proposal of a separable self-attention mechanism with linear time complexity, O(k)O(k). This approach replaces the costly batch-wise matrix multiplication with more efficient element-wise operations, making it well-suited for resource-constrained devices. The separable self-attention utilizes a latent token to compute context scores, which re-weights input tokens to encode global information efficiently.
  2. MobileViTv2: By integrating the separable self-attention into the MobileViT architecture, the authors present MobileViTv2. This new model achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming its predecessor by approximately 1% and running 3.2 times faster on mobile devices. The MobileViTv2 architecture scales efficiently across different complexities by using a width multiplier.
  3. Comparative Analysis: The separable self-attention was compared against traditional and Linformer-based self-attention. Results highlighted the efficiency of the proposed method both at module-level and architectural-level, showing significant improvements in speed without compromising accuracy.

Experimental Validation

The paper provides a thorough experimental validation across various tasks:

  • Object Classification: MobileViTv2 models outperformed existing transformer-based models and achieved performance levels comparable to CNN-based architectures while bridging the latency gap, particularly on resource-constrained devices.
  • Semantic Segmentation and Object Detection: The integration of MobileViTv2 into standard architectures such as PSPNet and DeepLabv3 demonstrated efficient performance on ADE20k and PASCAL VOC datasets. For object detection on MS-COCO, MobileViTv2 showed competitive results with significantly fewer parameters and FLOPs.

Implications and Future Work

The development of separable self-attention suggests promising implications for deploying transformer models on resource-constrained devices, such as mobile phones. It lowers the computational burden, thus extending the practical application of vision transformers in real-time scenarios. The approach could potentially be adapted and extended to other transformer-based architectures, such as those used in natural language processing, to enhance performance on devices with limited resources.

Future developments could explore further optimization of the separable self-attention mechanism, including investigating the utilization of multiple latent tokens or alternative projection strategies to push the boundaries of efficiency without trade-offs in performance. Additionally, there is potential for implementing hardware-specific optimizations that could further enhance the applicability of MobileViTv2 in diverse and constrained environments.

In summary, this work presents a significant step towards making vision transformers more viable for real-time applications on mobile and other resource-constrained platforms, opening up opportunities for expanded use and innovation in mobile computing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sachin Mehta (48 papers)
  2. Mohammad Rastegari (57 papers)
Citations (185)
Github Logo Streamline Icon: https://streamlinehq.com