Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution (2307.06304v1)

Published 12 Jul 2023 in cs.CV, cs.AI, and cs.LG

Abstract: The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

Authors (15)

Mostafa Dehghani (64 papers)
Basil Mustafa (32 papers)
Josip Djolonga (21 papers)
Jonathan Heek (13 papers)
Matthias Minderer (19 papers)
Mathilde Caron (25 papers)
Andreas Steiner (17 papers)
Joan Puigcerver (20 papers)
Robert Geirhos (28 papers)
Ibrahim Alabdulmohsin (31 papers)
Avital Oliver (9 papers)
Piotr Padlewski (9 papers)
Alexey Gritsenko (16 papers)
Mario Lučić (51 papers)
Neil Houlsby (62 papers)

Citations (63)

View on Semantic Scholar

Summary

Analysis of "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

The research paper introduces NaViT (Native Resolution Vision Transformer), an approach that leverages the flexibility of Vision Transformers (ViTs) to handle images of arbitrary resolutions and aspect ratios without the need for resizing to a fixed aspect ratio—a practice traditionally implemented but identified as suboptimal. This paper builds upon the versatile sequence-based modeling capacity of ViTs, introducing innovative methods for efficient training and inference.

Key Contributions

NaViT capitalizes on two primary innovations: the "Patch n' Pack" and the factorized positional embeddings. The former involves packing patches from different images within a single sequence, akin to example packing in NLP, thereby supporting images with varying resolutions and aspect ratios. The latter tackles the challenge of position embeddings in varying input sizes, where the factorization into content fractional positional embeddings facilitates improved performance across diverse scales.

The architectural modifications include the integration of masked self-attention and pooling to avoid cross-example interference within sequences. Additionally, NaViT introduces strategies for handling variable resolution sampling and token dropping, enabling more efficient training processes, notably within the constraints of fixed batch shapes common in modern deep learning hardware architectures.

Experimental Results

The experimental outcomes of NaViT surpass those of conventional fixed-resolution ViTs in numerous aspects:

Training Efficiency: NaViT consistently outperforms ViT baselines across varying computational scales by processing a substantially higher number of training examples. This is largely due to the combination of Patch n' Pack, variable resolution sampling, and adaptive token dropping strategies.
Performance at Variable Resolutions: The paper observes significant improvements when employing variable resolution during both pre-training and fine-tuning phases. Models trained under this paradigm showcase superior flexibility and performance across a range of resolutions, leading to compelling cost-performance trade-offs during inference.
Semantic and Object Recognition: The efficiency of NaViT extends to downstream applications including image and video classification, object detection, and semantic segmentation, with improved results on established benchmarks.

Theoretical and Practical Implications

The implications of this research are both practical and theoretical. Practically, NaViT offers a feasible route to improve model efficiency, reduce computational overhead, and enhance robustness and fairness across a wide range of applications without adhering to fixed input sizes. Theoretically, the introduction of factorized positional embeddings and the exploration of new resolution sampling strategies open pathways for further exploration in the scalability and adaptability of machine learning models.

Future Directions

The findings encourage directions such as adaptive computation and further integrations of hierarchical architectures with flexible resolution capabilities. Given the substantial impact of Patch n' Pack, exploring more sophisticated packing algorithms and the potential incorporation of multiscale feature hierarchies into the NaViT framework could yield additional gains. These developments may contribute to reducing the resource intensity of training large-scale models, an endeavor increasingly critical as model sizes and dataset complexities continue to escalate.

The contributions of this paper represent a noteworthy shift from conventional CNN-based pipelining towards more adaptable and resource-efficient paradigms utilizing Vision Transformers, setting a promising precedent for future advancements in computer vision and AI research.