FlexiViT: One Model for All Patch Sizes (2212.08013v2)

Published 15 Dec 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision

Authors (10)

Lucas Beyer (46 papers)
Pavel Izmailov (26 papers)
Alexander Kolesnikov (44 papers)
Mathilde Caron (25 papers)
Simon Kornblith (53 papers)
Xiaohua Zhai (51 papers)
Matthias Minderer (19 papers)
Michael Tschannen (49 papers)
Ibrahim Alabdulmohsin (31 papers)
Filip Pavetic (11 papers)

Citations (73)

View on Semantic Scholar

Summary

FlexiViT: One Model for All Patch Sizes

The paper presents FlexiViT, a model designed to address the limitations of Vision Transformers (ViTs) when it comes to varying patch sizes without requiring retraining. FlexiViT introduces a method where patch sizes are randomized during training, resulting in a single model that can perform well across a range of patch sizes. This adaptability allows for flexible computation budgets at deployment. This paper evaluates FlexiViT on various tasks including classification and segmentation, often achieving comparable or superior results to fixed patch size models.

Key Contributions

Patch Randomization Strategy: By randomizing patch sizes during training, FlexiViT sidesteps the necessity for retraining models for different patch sizes, providing compute-adaptive capabilities inherently.
Adaptive Resizing: The model employs adaptive resizing for patch embedding weights and positional embeddings, enabling it to handle diverse sequence lengths efficiently without architectural changes.
Compatibility with Pretrained Models: FlexiViT maintains compatibility with existing pretrained ViT models. This is achieved through resizing operations such as PI-resize, which preserves the integrity and effectiveness of patch embeddings across different patch sizes.
Superior Flexibility in Practical Applications: Through extensive evaluations on ImageNet-1k and various downstream tasks including open-world detection and semantic segmentation, FlexiViT demonstrates flexibility without sacrificing performance.
Resource-Efficient Transfer Learning: The ability of FlexiViT to retain flexibility at different patch sizes allows for significant savings in training resources. The model can be pretrained effectively with large patches and used with smaller patches during inference to optimize performance.

Strong Numerical Results

FlexiViT shows robust performance across tasks compared to standard ViT models trained on single patch sizes. It potentially matches or exceeds the performance of these models, particularly when evaluated with patch sizes different from those used during their training. In the discussed experiments, FlexiViT visibly outperforms fixed-size models in tasks such as semantic segmentation (e.g., Cityscapes mIoU), indicating enhanced adaptability and practical application potential.

Implications and Future Directions

The introduction of FlexiViT provides valuable insights into the flexibility of neural architectures, particularly in handling variable computational demands and model scalability. From a theoretical perspective, it demonstrates a significant shift in the traditional paradigm of model training and deployment, potentially influencing future research on model adaptation to dynamic environments.

Practically, FlexiViT's adaptability makes it a viable candidate for real-world scenarios where computational resources can be unpredictable or constrained. The principles introduced could inspire future explorations in automated adaptability, extending beyond vision tasks to broader AI applications. Moreover, this research may prompt further investigations into adaptive training methodologies, fostering deeper integration of flexibility within model architectures.

Overall, FlexiViT presents a significant step toward bridging the gap between performance optimization and resource efficiency in the context of neural networks dealing with vision tasks. By simplifying the adaptation to different computational settings, FlexiViT sets a foundation for more resilient and versatile AI systems in practical applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - google-research/big_vision: Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more. (1,877 stars)

Tweets

https://twitter.com/TheSeaMouse/status/1769506402702172505

https://twitter.com/EIFY/status/1837973369876332807

YouTube

Show All Videos