MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features (2209.15159v2)

Published 30 Sep 2022 in cs.CV, cs.AI, and cs.LG

Abstract: MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3

PDF Abstract

An Expert Overview of MobileViTv3: A Mobile-Friendly Vision Transformer

The paper presents MobileViTv3, an advanced iteration of the MobileViT architecture that aims to enhance mobile vision tasks by effectively integrating convolutional networks (CNNs) with vision transformers (ViTs). Unlike its predecessor, MobileViTv1, MobileViTv3 introduces a more streamlined approach to feature fusion, reducing complexity while improving scalability and performance.

Introduction to the Architecture

MobileViT models are specifically designed to operate efficiently on resource-constrained environments like mobile devices, balancing the trade-off between computational efficiency and model performance. The new MobileViTv3 architecture achieves this by refining the fusion strategy within its network blocks. The authors propose a redesign of the fusion block found in MobileViTv1 to create the MobileViTv3 block, simplifying the feature integration process and enhancing the network's learning capability. The proposed architecture introduces MobileViTv3-XXS, XS, S models that demonstrate increased performance over existing MobileViTv1 models on several benchmark datasets.

Core Contributions and Modifications

MobileViTv3 distinguishes itself by implementing four key modifications within its architecture:

Replacement of 3x3 with 1x1 Convolution in the Fusion Block: This change simplifies the fusion process by allowing integration of features without spatial interference from adjacent features, thus reducing both parameter count and computational cost when scaling the model.
Fusion of Local and Global Features: By leveraging the similarity between local features obtained from CNNs and global features from ViTs, MobileViTv3 stitches these features together, subsequently integrating input features at the final stage. This method contrasts with the previous strategy of fusing input directly with global features.
Inclusion of Residual Connections: By adding input features to the fusion process output, the architecture benefits from residual learning, a strategy known for facilitating optimization in deep networks.
Depthwise Convolutions: Using depthwise convolutions in the local representation block further reduces the model size and operations while preserving essential performance metrics.

Empirical Evaluation

On datasets such as ImageNet-1K, ADE20K, COCO, and PascalVOC2012, MobileViTv3 models consistently outperform their predecessors and other comparable architectures. Notably, MobileViTv3-XXS and -XS offer significant improvements in Top-1 accuracy by 2% and 1.9% respectively over MobileViTv1 equivalents with similar parameter budgets and computational costs. These performance increments are achieved despite maintaining a simple training regimen.

In segmentation tasks, MobileViTv3 also shows substantial gains, with a 2.07% improvement in mean IoU over MobileViTv2-1.0 on the ADE20K dataset. Object detection benefits as well, with the MobileViTv3-1.0 architecture offering a 0.5% mAP improvement over its precursor.

Implications and Future Work

This research underscores the potential for hybrid CNN-ViT architectures to effectively serve vision tasks on edge devices without compromising on accuracy or computational efficiency. By refining the integration process between local and global features, MobileViTv3 sets a practical example for future research seeking to balance model complexity against real-world deployment constraints.

Future advancements may explore optimizing self-attention computational requirements even further, thereby improving the viability of ViTs on extremely resource-limited hardware. Moreover, further scaling opportunities and batch size adjustments could potentially yield additional performance improvements, as suggested by the authors' findings.

In conclusion, MobileViTv3 represents a practical step forward in designing sophisticated, yet efficient vision models for mobile devices, maintaining a delicate balance between accuracy, latency, and throughput. Its adaptable architecture and demonstrable performance gains make it a valuable contribution to the ongoing development of mobile-friendliness in deep learning applications.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Shakti N. Wadekar (4 papers)
Abhishek Chaurasia (5 papers)

Citations (74)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - micronDLA/MobileViTv3 (252 stars)