Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling (2111.03930v2)

Published 6 Nov 2021 in cs.CV and cs.CL

Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. It shows impressive performance on zero-shot knowledge transfer to downstream tasks. To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. However, such a process still needs extra training and computational resources. In this paper, we propose \textbf{T}raining-Free CL\textbf{IP}-\textbf{Adapter} (\textbf{Tip-Adapter}), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set. In this non-parametric manner, Tip-Adapter acquires well-performed adapter weights without any training, which is both efficient and effective. Moreover, the performance of Tip-Adapter can be further boosted by fine-tuning such properly initialized adapter for only a few epochs with super-fast convergence speed. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter. The code will be released at \url{https://github.com/gaopengcuhk/Tip-Adapter}.

PDF Abstract

Tip-Adapter: Advancements in Training-Free Vision-LLMing

The paper presents Tip-Adapter, a novel approach to enhancing the capabilities of Contrastive Vision-Language Pre-training, specifically CLIP (Contrastive Language-Image Pre-training), without the need for additional training. By leveraging Tip-Adapter, the authors aim to preserve the training-free nature of CLIP while achieving competitive performance compared to traditional, training-intensive methodologies like CLIP-Adapter.

Background and Motivation

CLIP has demonstrated impressive zero-shot performance by utilizing contrastive learning on extensive image-text pair datasets. However, its few-shot learning performance leaves room for improvement. CLIP-Adapter improved this by incorporating a lightweight feature adapter that required additional training, thereby reintroducing computational demands. Tip-Adapter is developed to address this, maintaining a training-free setup while providing competitive, if not superior, performance in few-shot learning scenarios.

Methodology

Tip-Adapter employs a non-parametric approach by converting a key-value cache model from few-shot training data into adapter weights. This method involves:

Feature Extraction: Using the CLIP visual encoder to map images to visual features, while transforming labels into one-hot vectors.
Cache Construction: Treating features as keys and labels as values, constructing a cache model without backpropagation or gradient updates.
Adapter Integration: This cache is then used to initialize the weights of a two-layer MLP within the adapter, enabling immediate deployment without training fine-tuning.

This design ensures that Tip-Adapter remains computationally efficient while still enhancing CLIP's baseline performance by utilizing few-shot data effectively.

Results and Analysis

The results, evaluated across 11 datasets including ImageNet, underscore the efficacy of Tip-Adapter. Notably, the performance of Tip-Adapter is comparable to CLIP-Adapter under few-shot conditions, often surpassing other state-of-the-art methods like CoOp and linear-probe CLIP, especially with fewer shots:

On ImageNet, Tip-Adapter achieves substantial performance gains over zero-shot CLIP, showing improvements of 1.70% with zero additional epochs compared to CLIP-Adapter's 200.
Across the remaining datasets, Tip-Adapter consistently exhibits superiority over zero-shot implementations, and when further fine-tuned (Tip-Adapter-F), it achieves the highest accuracy amongst compared systems.

Implications and Future Work

The implications of Tip-Adapter are significant in resource-constrained environments, where training budget and computational resources are limited. The non-parametric nature and efficiency gains offer a practical solution for implementing high-performance few-shot learning without the overhead of training extensive models.

Looking forward, further enhancements could explore the optimization of adapter weight construction methods or expanding the non-parametric models' applicability to broader, multimodal tasks. Extensions could include exploring dynamic cache updates or incorporating larger pre-trained models, providing even deeper integration into multimodal tasks.

This paper positions Tip-Adapter as a compelling intersection of efficiency and performance in vision-LLMing, setting the stage for broader applications and innovations in AI with minimal computational burden.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Renrui Zhang (100 papers)
Rongyao Fang (18 papers)
Wei Zhang (1489 papers)
Peng Gao (401 papers)
Jifeng Dai (131 papers)
Yu Qiao (563 papers)
Hongsheng Li (340 papers)
KunChang Li (43 papers)

Citations (333)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - gaopengcuhk/Tip-Adapter (506 stars)