Separable Self-attention for Mobile Vision Transformers
The paper presented introduces a novel approach to addressing the computational inefficiencies of Mobile Vision Transformers (MobileViT), specifically focusing on reducing latency through separable self-attention mechanisms. MobileViT models have demonstrated state-of-the-art performance across several mobile vision tasks, including classification and detection. However, the efficiency bottleneck in these models lies within the multi-headed self-attention (MHA) component of transformers, which traditionally requires time complexity concerning the number of tokens (or patches) .
Key Contributions
- Separable Self-attention: The primary contribution of this work is the proposal of a separable self-attention mechanism with linear time complexity, . This approach replaces the costly batch-wise matrix multiplication with more efficient element-wise operations, making it well-suited for resource-constrained devices. The separable self-attention utilizes a latent token to compute context scores, which re-weights input tokens to encode global information efficiently.
- MobileViTv2: By integrating the separable self-attention into the MobileViT architecture, the authors present MobileViTv2. This new model achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming its predecessor by approximately 1% and running 3.2 times faster on mobile devices. The MobileViTv2 architecture scales efficiently across different complexities by using a width multiplier.
- Comparative Analysis: The separable self-attention was compared against traditional and Linformer-based self-attention. Results highlighted the efficiency of the proposed method both at module-level and architectural-level, showing significant improvements in speed without compromising accuracy.
Experimental Validation
The paper provides a thorough experimental validation across various tasks:
- Object Classification: MobileViTv2 models outperformed existing transformer-based models and achieved performance levels comparable to CNN-based architectures while bridging the latency gap, particularly on resource-constrained devices.
- Semantic Segmentation and Object Detection: The integration of MobileViTv2 into standard architectures such as PSPNet and DeepLabv3 demonstrated efficient performance on ADE20k and PASCAL VOC datasets. For object detection on MS-COCO, MobileViTv2 showed competitive results with significantly fewer parameters and FLOPs.
Implications and Future Work
The development of separable self-attention suggests promising implications for deploying transformer models on resource-constrained devices, such as mobile phones. It lowers the computational burden, thus extending the practical application of vision transformers in real-time scenarios. The approach could potentially be adapted and extended to other transformer-based architectures, such as those used in natural language processing, to enhance performance on devices with limited resources.
Future developments could explore further optimization of the separable self-attention mechanism, including investigating the utilization of multiple latent tokens or alternative projection strategies to push the boundaries of efficiency without trade-offs in performance. Additionally, there is potential for implementing hardware-specific optimizations that could further enhance the applicability of MobileViTv2 in diverse and constrained environments.
In summary, this work presents a significant step towards making vision transformers more viable for real-time applications on mobile and other resource-constrained platforms, opening up opportunities for expanded use and innovation in mobile computing.