Convolutional Bypasses Are Better Vision Transformer Adapters
This paper addresses the challenges faced in the parameter-efficient transfer learning (PETL) paradigm, especially when adapting large Vision Transformer (ViT) models to downstream visual tasks. With ViT models exponentially growing in size, full fine-tuning becomes impractical due to substantial storage overheads. While PETL strategies—originating in the context of NLP—have been adapted to ViT, these methods often lack visual task-specific enforcement of inductive biases, which the authors argue is a limitation.
The authors propose the use of Convolutional Bypasses (Convpass) as a novel adaptation module tailored for ViT models. Convpass integrates lightweight yet effective convolutional modules into ViT, introducing a hard-coded inductive bias more aligned with visual tasks. Convpass requires the addition of less than 0.5% of trainable parameters relative to the full model, facilitating efficient adaptation even in data-constrained scenarios.
Key Insights and Methodology
Building on observed limitations of language-oriented PETL methods, the paper emphasizes the need for visual inductive biases in ViT adaptation. Existing PETL strategies like Adapters, LoRA, and VPT, although useful, are inherently designed for NLP tasks and do not optimally utilize spatial properties crucial for visual recognition.
The proposed Convpass modules function parallel to Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP) blocks within ViT layers. By leveraging convolutional layers, Convpass introduces spatial locality features into the model. This approach effectively reestablishes spatial structures in flattened image token sequences, enabling individual attention to [cls] tokens and image tokens via convolution.
Through extensive experimentation across VTAB-1K benchmarks and few-shot learning datasets, Convpass demonstrates superiority over traditional language-oriented PETL methods. Convpass exhibits enhanced performance in low-data regimes, highlighting its effectiveness in tasks with limited training samples.
Experimental Results
Empirical results on the VTAB-1K benchmark reveal that Convpass consistently outperforms state-of-the-art language-oriented PETL methods across various visual tasks. Convpass achieves significant improvement on average results, outperforming over 75% of existing methods on the benchmark. Similarly, in few-shot learning experiments across fine-grained datasets, Convpass provides robust improvements, reinforcing its data efficiency advantages.
Through analytical contrast with architectures like Swin Transformers and ConvNeXt, which inherently possess visual inductive biases, Convpass further proves its applicability in bridging the inductive bias gap in ViT. The analysis validates that Convpass helps lift ViT to competitive performance levels, often surpassing full fine-tuning of its convolutional counterparts.
Implications and Future Directions
The introduction of vision-oriented adaptation modules sets a precedent in customizing transfer learning techniques for vision transformers. Convpass not only mitigates the parameter inefficiency problem in ViT but also sets a promising path for further exploration into convolution-integrating modules that enhance spatial representation.
As ViT and other transformer models continue to proliferate in visual domains, integrating structured inductive biases will be crucial for handling diverse datasets and tasks efficiently. Future developments could explore hybrid architectures that further refine the balance between efficiency and inductive bias integration, possibly exploring dynamic architectural adjustments based on task-driven needs.
In conclusion, the paper offers a thorough investigation and a compelling case for Convpass as an optimized PETL method tailored specifically for vision transformers. This approach widens the frontier for vision-oriented adaptations and establishes a foundational methodology for subsequent innovations in the efficient finetuning of large-scale visual models.