Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners (2303.02151v1)

Published 3 Mar 2023 in cs.CV and cs.CL

Abstract: Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.

PDF Abstract

An Examination of "Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners"

This paper introduces a novel framework, referred to as CaFo, designed to enhance visual recognition within low-data regimes. The approach leverages a cascade of foundation models to exploit the pre-training knowledge embedded within various large-scale generative and contrastive models, specifically CLIP, DINO, DALL-E, and GPT-3. Each of these models contributes unique strengths to the CaFo framework, enhancing its applicability for few-shot learning tasks.

Framework Overview

The foundation of the proposed methodology lies in its "Prompt, Generate, then Cache" strategy, which systematically integrates distinct forms of pre-training knowledge for improved few-shot learning performance:

Prompt with GPT-3: The first step involves generating enhanced textual prompts using GPT-3. By inputting hand-crafted templates, refined prompts abundant in linguistic semantics are created for CLIP's textual encoder. This extension is shown to improve text-image alignment and augment CLIP's classification performance in few-shot scenarios.
Generate via DALL-E: Utilizing the vision-generative capabilities of DALL-E, synthetic images for different categories are generated. This augmentation responsibly enriches the few-shot training datasets, overcoming data scarcity without requiring additional human-labeled data.
Cache by CLIP and DINO: The final step involves the construction of a key-value cache model that adaptively blends predictions from CLIP and DINO, leveraging their language-contrastive and vision-contrastive knowledge, respectively. This ensemble approach refines the classification probabilities, increasing accuracy by adaptively weighting predictions based on their distributions.

Empirical Validation

Through extensive empirical evaluation across 11 benchmark datasets, CaFo demonstrates superior performance over existing methods in few-shot learning. Particularly noteworthy is CaFo's ability to surpass other models' accuracy in low-shot settings without utilizing extra annotated data, underlining the efficacy of integrating multiple foundation models.

Implications and Future Directions

The findings imply significant practical applications for environments constrained by data availability. Theoretically, they open avenues for enhancing pre-trained models' effectiveness by synergizing seemingly disparate knowledge bases. This paper encourages future research pursuits towards the assimilation of broader pre-training paradigms, including potential expansions into self-supervised and transfer learning domains not covered by current models.

Conclusion

This paper effectively articulates a forward-thinking approach to improving few-shot learning performance by harmonizing pre-training knowledge from diverse models. While CaFo already sets a high standard, further exploration into integrating more models and broader application use-cases can offer even greater advancements in AI-driven recognition tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Renrui Zhang (100 papers)
Xiangfei Hu (4 papers)
Bohao Li (20 papers)
Siyuan Huang (123 papers)
Hanqiu Deng (9 papers)
Hongsheng Li (340 papers)
Yu Qiao (563 papers)
Peng Gao (401 papers)

Citations (138)

View on Semantic Scholar

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners (2303.02151v1)

An Examination of "Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners"

Framework Overview

Empirical Validation

Implications and Future Directions

Conclusion

Related Papers

GitHub

YouTube