ViTAR: Vision Transformer with Any Resolution (2403.18361v2)
Abstract: This paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.
- End-to-end object detection with transformers. In ECCV, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Conditional positional encodings for vision transformers. In ICLR, 2023.
- Contributors, M. Mmsegmentation, an open source semantic segmentation toolbox, 2020.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Davit: Dual attention vision transformers. In ECCV, 2022.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Lightweight vision transformer with bidirectional interaction. In NeurIPS, 2023.
- Cmt: Convolutional neural networks meet vision transformers. In CVPR, 2022.
- Neighborhood attention transformer. In CVPR, 2023.
- Mask r-cnn. In ICCV, 2017.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Vision transformer with super token sampling. In CVPR, 2023.
- All tokens matter: Token labeling for training better vision transformers. In NeurIPS, 2021.
- Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Randomized positional encodings boost length generalization of transformers. In ACL, 2023.
- Inception transformer. In NeurIPS, 2022.
- Resformer: Scaling vits with multi-resolution training. In CVPR, 2023.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Droppos: Pre-training vision transformers by reconstructing dropped positions. In NeurIPS, 2023.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):1–10, 2022.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Svformer: Semi-supervised video transformer for action recognition. In CVPR, 2023.
- Volo: Vision outlooker for visual recognition. TPAMI, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Random erasing data augmentation. In AAAI, 2020.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Qihang Fan (13 papers)
- Quanzeng You (41 papers)
- Xiaotian Han (46 papers)
- Yongfei Liu (25 papers)
- Yunzhe Tao (20 papers)
- Huaibo Huang (58 papers)
- Ran He (172 papers)
- Hongxia Yang (130 papers)