ViTAR: Vision Transformer with Any Resolution (2403.18361v2)

Published 27 Mar 2024 in cs.CV

Abstract: This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.

PDF HTML Abstract

Vision Transformers Adapted for Any Image Resolution: Introducing ViTAR

Introduction to ViTAR

In the domain of Vision Transformers (ViTs), handling image data across varying resolutions without compromising computational efficiency and model performance presents a significant challenge. This paper introduces the Vision Transformer with Any Resolution (ViTAR), an innovative approach that substantially enhances the adaptability of ViTs to process images of various resolutions efficiently. ViTAR is distinguished by two key innovations: the Adaptive Token Merger (ATM) and Fuzzy Positional Encoding (FPE), which collectively empower the model with unprecedented resolution adaptability and computational efficiency.

Key Innovations

Adaptive Token Merger (ATM)

The ATM module is at the heart of ViTAR's design to process images across different resolutions efficiently. This novel module operates by dynamically adjusting the resolution through a process called GridAttention, where tokens are progressively merged to achieve a constant dimensionality, irrespective of the input image size. This method not only enhances the resolution adaptability of the model but also substantially reduces the computational load when handling high-resolution images. ViTAR showcases a remarkably low computational cost while maintaining or even surpassing model performance across various resolutions.

Fuzzy Positional Encoding (FPE)

To further bolster ViTAR's resolution robustness, the introduction of Fuzzy Positional Encoding (FPE) marks a significant advancement. Unlike traditional positional encoding methods, which are highly sensitive to resolution changes, FPE imparts a degree of positional perturbation. This effectively prevents overfitting to specific resolutions during training, enabling the model to generalize across unseen resolutions during inference phase more robustly. This approach can be seen as an implicit form of data augmentation, allowing ViTAR to learn more generalized positional information and enhancing its overall adaptability and performance.

Experiments and Results

The efficacy of ViTAR was demonstrated through extensive experiments across a variety of tasks, including image classification, object detection, instance segmentation, semantic segmentation, and compatibility with self-supervised learning frameworks like MAE. ViTAR achieved remarkable results, particularly in image classification where it exhibits superior adaptability across a wide range of input resolutions, significantly outperforming existing models like DeiT and ResFormer in both accuracy and computational efficiency. In downstream tasks requiring high-resolution inputs, ViTAR demonstrated comparable performance to state-of-the-art models while requiring considerably fewer computational resources.

Implications and Future Directions

The introduction of ViTAR presents several important implications for the future of high-resolution image processing and generative AI fields. The ability of ViTAR to efficiently adapt to any given resolution, combined with its compatibility with self-supervised learning frameworks, paves the way for more versatile and computationally efficient vision transformers. This could significantly impact areas requiring high-resolution image handling, like satellite image analysis, medical imaging, and more.

Furthermore, the innovative approaches of ATM and FPE open new avenues for research into resolution adaptability and efficiency in vision transformers. Future work may explore deeper integration of these concepts with other transformer architectures and their applications beyond vision-based tasks, potentially extending to video processing and multimodal learning.

Conclusion

ViTAR represents a significant step forward in the development of Vision Transformers, offering a scalable, efficient solution for processing high-resolution images across varied tasks without sacrificing performance. By addressing the critical challenges of adaptability and computational efficiency, ViTAR sets a new benchmark for future research and applications of vision transformers in handling diverse and high-resolution image data.