Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification (2207.09519v1)

Published 19 Jul 2022 in cs.CV, cs.AI, and cs.CL

Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. It shows impressive performance on downstream tasks by zero-shot knowledge transfer. To further enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules, which significantly improves the few-shot performance but introduces extra training time and computational resources. In this paper, we propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter, which not only inherits the training-free advantage of zero-shot CLIP but also performs comparably to those training-required approaches. Tip-Adapter constructs the adapter via a key-value cache model from the few-shot training set, and updates the prior knowledge encoded in CLIP by feature retrieval. On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient. We conduct extensive experiments of few-shot classification on 11 datasets to demonstrate the superiority of our proposed methods. Code is released at https://github.com/gaopengcuhk/Tip-Adapter.

PDF Abstract

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

The paper introduces a method known as Tip-Adapter that significantly enhances the adaptability of Contrastive Vision-Language Pre-training (CLIP) in few-shot classification tasks without relying on additional training. Existing approaches to improving CLIP's adaptation typically involve fine-tuning, which demands extra training time and computational resources. In contrast, Tip-Adapter maintains zero-shot efficiency while performing on par with training-intensive methods.

Tip-Adapter’s architecture uses a novel key-value cache model to integrate knowledge from the few-shot training set directly. This model augments pre-trained CLIP features by retrieving relevant knowledge from a constructed cache without requiring training updates. The cache comprises visual features and corresponding one-hot encoded labels. During inference, a test image computes its feature similarities with keys in the cache model, aggregates these affinities to form an adapter prediction, and combines this with the CLIP output via a residual connection.

Key results demonstrate Tip-Adapter’s comparable performance to state-of-the-art methods like CoOp and CLIP-Adapter while being training-free. For instance, in the 16-shot ImageNet setup, Tip-Adapter achieves an accuracy gain of +1.70% over Zero-shot CLIP. When integrated with minimal fine-tuning (20 epochs), the modified Tip-Adapter-F sets a new benchmark with notably reduced training resources.

Through extensive experiments across 11 datasets, including ImageNet, StanfordCars, and OxfordPets, Tip-Adapter consistently shows its superior accuracy-efficiency trade-off. The approach leverages both the conceptual foundation of cache models and the frozen weights of the underlying CLIP, effectively bypassing the heavy computational demands typical of deep learning fine-tuning.

The analysis outlines significant factors influencing performance, such as the residual ratio and cache size. Furthermore, Tip-Adapter-F exhibits state-of-the-art results with rapid convergence, outperforming methods requiring 200 epochs by a meaningful margin on few-shot datasets. Importantly, even with minimal shots, Tip-Adapter demonstrated boosts in accuracy, exemplified by EuroSAT's +33.02% over the Zero-shot CLIP baseline.

The paper progresses with a discussion on the method’s implications and potential future advancements. In particular, Tip-Adapter’s innovative training-free approach suggests several foundational possibilities for leveraging large pre-trained models in resource-constrained scenarios. Moving forward, research could focus on extending the caching framework to more complex tasks or domains outside current experimental settings.

In conclusion, Tip-Adapter offers a compelling solution for enhancing few-shot performance with minimal computational overhead. Its non-parametric design and strategic incorporation of few-shot knowledge position it as a valuable asset amidst the evolving landscape of AI model adaptation.

This work, grounded in empirical evidence and careful design innovation, highlights an efficient pathway for exploiting large-scale pre-trained models without the burdensome cost of retraining, underscoring its practical and theoretical contributions to the field.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Renrui Zhang (100 papers)
Zhang Wei (14 papers)
Rongyao Fang (18 papers)
Peng Gao (401 papers)
Jifeng Dai (131 papers)
Yu Qiao (563 papers)
Hongsheng Li (340 papers)
KunChang Li (43 papers)

Citations (222)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - gaopengcuhk/Tip-Adapter (493 stars)