Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Compose Soft Prompts for Compositional Zero-Shot Learning (2204.03574v3)

Published 7 Apr 2022 in cs.LG, cs.CL, and cs.CV

Abstract: We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-LLMs (VLMs) like CLIP. We develop CSP for compositional zero-shot learning, the task of predicting unseen attribute-object compositions (e.g., old cat and young tiger). VLMs have a flexible text encoder that can represent arbitrary classes as natural language prompts but they often underperform task-specific architectures on the compositional zero-shot benchmark datasets. CSP treats the attributes and objects that define classes as learnable tokens of vocabulary. During training, the vocabulary is tuned to recognize classes that compose tokens in multiple ways (e.g., old cat and white cat). At test time, we recompose the learned attribute-object vocabulary in new combinations to recognize novel classes. We show that CSP outperforms the CLIP on benchmark datasets by an average of 10.9 percentage points on AUC. CSP also outperforms CoOp, a soft prompting method that fine-tunes the prefix context tokens, by an average of 5.8 percentage points on AUC. We perform additional experiments to show that CSP improves generalization to higher-order attribute-attribute-object compositions (e.g., old white cat) and combinations of pretrained attributes and fine-tuned objects. The code is available at https://github.com/BatsResearch/csp.

An Analysis of Compositional Soft Prompting for Vision-LLMs

The paper "Learning to Compose Soft Prompts for Compositional Zero-Shot Learning" introduces a novel learning method called Compositional Soft Prompting (CSP) to advance the zero-shot compositionality of large-scale vision-LLMs (VLMs), specifically for the task of predicting unseen attribute-object compositions. The research focuses on the CLIP model, aiming to enhance its performance on benchmark datasets against task-specific architectures. The methodology and results of the paper are pivotal for improving the integration of pre-trained vision and language encoders in compositional scenarios.

The authors address the limitations inherent in composing pre-existing word embeddings and image encoders, which are often trained separately and exhibit flexibility constraints in adapting to higher-order attribute-object compositions, such as "old white cat." They propose leveraging the intrinsic compositional capabilities of VLMs like CLIP, which, while extensively pre-trained on image-text pairs, demonstrate shortcomings when tasked with compositional zero-shot inference.

CSP emerges as a parameter-efficient learning technique that refines the representation of attributes and objects as learnable tokens within the model's vocabulary. Unlike traditional soft-prompting approaches, which focus on tuning single-task prompts, CSP trains on diverse attribute-object combinations. This process constructs a dynamic vocabulary that supports the recomposition of learned prompt components at test time to classify unseen classes. By treating attributes and objects as distinct vocabulary tokens, the model enhances its ability to recognize and generalize compositional patterns.

The model's efficacy is validated across three benchmark datasets—MIT-States, UT-Zappos, and C-GQA—demonstrating significant improvements. For instance, CSP outperforms CLIP by an average of 10.9 percentage points and CoOp by 5.8 percentage points on the AUC metric. Notably, the results underline CSP's ability to generalize to more complex scenarios, such as higher-order compositions and the fusion of pre-trained with novel attributes. This ability is crucial for advancing AI models' comprehension of nuanced attribute-object relationships without necessitating extensive retraining.

The authors also explore CSP's performance against fully fine-tuned models and specialized architectures, illustrating significant advantages in terms of parameter efficiency and adaptability. While CLIP fine-tuning enhances zero-shot capabilities, it still falls short of CSP's results on certain datasets like UT-Zappos, underscoring the benefits of a compositional approach.

Implications of this research are manifold. Practically, CSP allows for more efficient deployment of VLMs in scenarios where extensive retraining is infeasible or data-limited. Theoretically, it paves the way for further innovations in prompt engineering and compositional reasoning within machine learning frameworks. The paper posits potential expansions into a broader array of vision-language tasks and more complex compositional reasoning, suggesting CSP could influence future models' architectural design by seeking to optimize composability and parameter efficiency.

In conclusion, the proposed CSP technique represents a significant contribution to the field of compositional zero-shot learning by leveraging soft prompting tailored to compositional primitives. This paper enhances our understanding of how large-scale pre-trained models can be adapted and optimized to achieve superior interpretability and inference in zero-shot settings, thereby driving forward the capabilities of AI in comprehending and reasoning about the world through language-grounded imagery.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nihal V. Nayak (9 papers)
  2. Peilin Yu (9 papers)
  3. Stephen H. Bach (33 papers)
Citations (49)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com