Improving Zero-Shot Generalization for CLIP with Synthesized Prompts (2307.07397v1)

Published 14 Jul 2023 in cs.CV and cs.LG

Abstract: With the growing interest in pretrained vision-LLMs like CLIP, recent research has focused on adapting these models to downstream tasks. Despite achieving promising results, most existing methods require labeled data for all classes, which may not hold in real-world applications due to the long tail and Zipf's law. For example, some classes may lack labeled data entirely, such as emerging concepts. To address this problem, we propose a plug-and-play generative approach called \textbf{S}ynt\textbf{H}es\textbf{I}zed \textbf{P}rompts~(\textbf{SHIP}) to improve existing fine-tuning methods. Specifically, we follow variational autoencoders to introduce a generator that reconstructs the visual features by inputting the synthesized prompts and the corresponding class names to the textual encoder of CLIP. In this manner, we easily obtain the synthesized features for the remaining label-only classes. Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled and synthesized features. Extensive experiments on base-to-new generalization, cross-dataset transfer learning, and generalized zero-shot learning demonstrate the superiority of our approach. The code is available at \url{https://github.com/mrflogs/SHIP}.

PDF Abstract

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

The paper presents an innovative approach to enhance the zero-shot generalization capabilities of the CLIP model through synthesized prompts. Acknowledging the limitations posed by the need for labeled data in traditional fine-tuning methods, this paper introduces a novel generative approach, SyntHesIzed Prompts (SHIP), aimed at improving CLIP’s adaptability to emerging concepts and data-sparse classes adhering to Zipf's law.

Method and Approach

This paper leverages the architecture of Variational Autoencoders (VAE) to forge a generative model that synthesizes visual representations that correspond to textual prompts and class names fed into CLIP’s language encoder. This approach circumvents the need for vast labeled datasets, especially for classes without sufficient data representation.

The SHIP framework utilizes a VAE for generating prompts that align closely with the learned representations of CLIP. The generator component reconstructs visual features by utilising synthesized prompts within the language space, bolstering feature representation for classes that are predominantly label-only. To optimize data efficiency, the paper selectively chooses VAEs over adversarial networks for its generative needs, given their efficacy in scenarios with limited data.

The generated prompts incorporate both global learnable vectors and local biases, refined through lightweight MLP networks, maintaining synchronization with CLIP’s pretrained encoders. This synthesized data pool is then amalgamated with existing datasets, subsequently fine-tuned using established methods to enhance zero-shot recognition proficiency.

Results and Performance

The experimental evaluation through comprehensive tests including base-to-new generalization, cross-dataset transfer learning, and generalized zero-shot learning underscores the potency of SHIP. The work visibly demonstrates performance enhancements across multiple benchmarks, with notable improvements highlighted in visual recognition across traditionally challenging image datasets.

For instance, the proposed method raises the average accuracy and harmonic mean significantly compared to baseline models. SHIP’s integration with different established tuning methods shows versatility, yielding superior results across datasets like ImageNet and others, emphasizing efficacy over CoOp and CLIP-Adapter baselines.

Implications and Future Developments

The practical implications of this research highlight the potentially transformative impact of generative models in the zero-shot learning paradigm. By reducing reliance on labeled data, SHIP sets the stage for broadening AI's horizon in applications involving emergent concepts and underrepresented classes, essential for real-world scenarios depicting long-tailed distributions.

Theoretically, this proposes a shift in the vision-LLMing space, paving avenues for future explorations in fine-tuning methodologies leveraging synthetic data. Future research could explore streamlining the computational demands of SHIP, potentially investigating its application in more complex dense prediction tasks or its adaptability across distinct large-scale generative frameworks.

In conclusion, the paper contributes substantially to current understanding and methodologies in zero-shot learning, challenging traditional data-heavy paradigms while offering a feasible alternative solution through synthetic prompt generation. This approach not only addresses existing limitations but also opens up numerous prospects for further refinement and application in broader AI contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zhengbo Wang (8 papers)
Jian Liang (162 papers)
Ran He (172 papers)
Nan Xu (83 papers)
Zilei Wang (37 papers)
Tieniu Tan (119 papers)

Citations (35)

View on Semantic Scholar

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts (2307.07397v1)

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

Method and Approach

Results and Performance

Implications and Future Developments

Related Papers

GitHub

YouTube