Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViTAR: Vision Transformer with Any Resolution (2403.18361v2)

Published 27 Mar 2024 in cs.CV

Abstract: This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. End-to-end object detection with transformers. In ECCV, 2020.
  2. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  3. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  4. Conditional positional encodings for vision transformers. In ICLR, 2023.
  5. Contributors, M. Mmsegmentation, an open source semantic segmentation toolbox, 2020.
  6. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 2020.
  7. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  8. Davit: Dual attention vision transformers. In ECCV, 2022.
  9. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  11. Lightweight vision transformer with bidirectional interaction. In NeurIPS, 2023.
  12. Cmt: Convolutional neural networks meet vision transformers. In CVPR, 2022.
  13. Neighborhood attention transformer. In CVPR, 2023.
  14. Mask r-cnn. In ICCV, 2017.
  15. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  16. Vision transformer with super token sampling. In CVPR, 2023.
  17. All tokens matter: Token labeling for training better vision transformers. In NeurIPS, 2021.
  18. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
  19. Microsoft coco: Common objects in context. In ECCV, 2014.
  20. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  21. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
  22. Learning transferable visual models from natural language supervision. In ICML, 2021.
  23. Randomized positional encodings boost length generalization of transformers. In ACL, 2023.
  24. Inception transformer. In NeurIPS, 2022.
  25. Resformer: Scaling vits with multi-resolution training. In CVPR, 2023.
  26. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  27. Attention is all you need. In NeurIPS, 2017.
  28. Droppos: Pre-training vision transformers by reconstructing dropped positions. In NeurIPS, 2023.
  29. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  30. Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):1–10, 2022.
  31. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  32. Svformer: Semi-supervised video transformer for action recognition. In CVPR, 2023.
  33. Volo: Vision outlooker for visual recognition. TPAMI, 2022.
  34. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  35. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  36. Random erasing data augmentation. In AAAI, 2020.
  37. Scene parsing through ade20k dataset. In CVPR, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Qihang Fan (13 papers)
  2. Quanzeng You (41 papers)
  3. Xiaotian Han (46 papers)
  4. Yongfei Liu (25 papers)
  5. Yunzhe Tao (20 papers)
  6. Huaibo Huang (58 papers)
  7. Ran He (172 papers)
  8. Hongxia Yang (130 papers)
Citations (6)

Summary

Vision Transformers Adapted for Any Image Resolution: Introducing ViTAR

Introduction to ViTAR

In the domain of Vision Transformers (ViTs), handling image data across varying resolutions without compromising computational efficiency and model performance presents a significant challenge. This paper introduces the Vision Transformer with Any Resolution (ViTAR), an innovative approach that substantially enhances the adaptability of ViTs to process images of various resolutions efficiently. ViTAR is distinguished by two key innovations: the Adaptive Token Merger (ATM) and Fuzzy Positional Encoding (FPE), which collectively empower the model with unprecedented resolution adaptability and computational efficiency.

Key Innovations

Adaptive Token Merger (ATM)

The ATM module is at the heart of ViTAR's design to process images across different resolutions efficiently. This novel module operates by dynamically adjusting the resolution through a process called GridAttention, where tokens are progressively merged to achieve a constant dimensionality, irrespective of the input image size. This method not only enhances the resolution adaptability of the model but also substantially reduces the computational load when handling high-resolution images. ViTAR showcases a remarkably low computational cost while maintaining or even surpassing model performance across various resolutions.

Fuzzy Positional Encoding (FPE)

To further bolster ViTAR's resolution robustness, the introduction of Fuzzy Positional Encoding (FPE) marks a significant advancement. Unlike traditional positional encoding methods, which are highly sensitive to resolution changes, FPE imparts a degree of positional perturbation. This effectively prevents overfitting to specific resolutions during training, enabling the model to generalize across unseen resolutions during inference phase more robustly. This approach can be seen as an implicit form of data augmentation, allowing ViTAR to learn more generalized positional information and enhancing its overall adaptability and performance.

Experiments and Results

The efficacy of ViTAR was demonstrated through extensive experiments across a variety of tasks, including image classification, object detection, instance segmentation, semantic segmentation, and compatibility with self-supervised learning frameworks like MAE. ViTAR achieved remarkable results, particularly in image classification where it exhibits superior adaptability across a wide range of input resolutions, significantly outperforming existing models like DeiT and ResFormer in both accuracy and computational efficiency. In downstream tasks requiring high-resolution inputs, ViTAR demonstrated comparable performance to state-of-the-art models while requiring considerably fewer computational resources.

Implications and Future Directions

The introduction of ViTAR presents several important implications for the future of high-resolution image processing and generative AI fields. The ability of ViTAR to efficiently adapt to any given resolution, combined with its compatibility with self-supervised learning frameworks, paves the way for more versatile and computationally efficient vision transformers. This could significantly impact areas requiring high-resolution image handling, like satellite image analysis, medical imaging, and more.

Furthermore, the innovative approaches of ATM and FPE open new avenues for research into resolution adaptability and efficiency in vision transformers. Future work may explore deeper integration of these concepts with other transformer architectures and their applications beyond vision-based tasks, potentially extending to video processing and multimodal learning.

Conclusion

ViTAR represents a significant step forward in the development of Vision Transformers, offering a scalable, efficient solution for processing high-resolution images across varied tasks without sacrificing performance. By addressing the critical challenges of adaptability and computational efficiency, ViTAR sets a new benchmark for future research and applications of vision transformers in handling diverse and high-resolution image data.

Youtube Logo Streamline Icon: https://streamlinehq.com