TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
The paper introduces TopFormer, a novel vision transformer architecture designed to address the computational constraints of mobile devices in dense prediction tasks, particularly semantic segmentation. The architecture aims to surpass the performance of both traditional Convolutional Neural Networks (CNNs) and existing Vision Transformer (ViT) models by achieving an optimal trade-off between accuracy and computational latency.
Architecture and Methodology
TopFormer is structured around a token pyramid, extracted from mobile-friendly convolutional layers inspired by MobileNetV2, which processes high-resolution input images into smaller-scale informative tokens. These tokens, representing multi-scale features, are leveraged as inputs to the Vision Transformer block, maximizing semantic richness while minimizing computational load.
Key components include:
- Token Pyramid Module: Efficiently constructed using lightweight MobileNetV2 blocks, this module generates multi-scale feature representations, which are progressively pooled and concatenated to form a reduced representation for subsequent transformer operations.
- Semantics Extractor: This segment utilizes several transformer blocks incorporating Multi-Head Self-Attention (MHSA) to provide scale-aware global semantics. These blocks are optimized with batch normalization and a reduced number of channels in keys and queries, significantly decreasing the computational footprint.
- Semantics Injection Module: Combines scale-aware semantics with localized token features, enhancing representation in a computationally efficient manner. By applying operations such as sigmoid attention and feature concatenation, this module enables robust hierarchical feature formation conducive to dense prediction tasks.
- Segmentation Head: Utilizes upscaled augmented tokens for final segmentation map production, presenting significant improvements over existing CNN and ViT techniques in terms of precision and efficiency.
Experimental Results
TopFormer demonstrates substantial advancements over prior models across various datasets, including ADE20K, Pascal Context, and COCO-Stuff, showcasing increased mean Intersection over Union (mIoU) scores while maintaining lower latency and computational costs. Notably, on ADE20K, TopFormer surpasses MobileNetV3 by 5% in mIoU with reduced latency on ARM-based devices. The tiny iteration of TopFormer achieves real-time segmentation capabilities on these devices, underscoring its practical application in mobile contexts.
Implications and Future Directions
The proposed architecture has significant implications for mobile AI applications where efficiency is crucial. It sets a benchmark for balancing accuracy with resource constraints, paving the way for further exploration into lightweight transformers for various vision tasks. Subsequent research could enhance object detection capabilities and refine model scalability across different device profiles.
In summary, TopFormer represents a significant stride in vision transformer design for mobile applications. By effectively leveraging the strengths of both CNNs and transformers and emphasizing scale-aware processing, it addresses the challenges inherent in deploying dense prediction models on resource-constrained platforms. Future iterations may focus on expanding application domains and refining architecture components for broader AI deployment scenarios.