Prompt Tuning for Zero-shot Compositional Learning (2312.02191v1)

Published 2 Dec 2023 in cs.CV and cs.AI

Abstract: Open World Compositional Zero-Shot Learning (OW-CZSL) is known to be an extremely challenging task, which aims to recognize unseen compositions formed from seen attributes and objects without any prior assumption of the output space. In order to achieve this goal, a model has to be "smart" and "knowledgeable". To be smart, a model should be good at reasoning the interactions between attributes and objects from the seen compositions. While "knowledgeable" means the model owns "common sense" to the open world that can "foresee" some features of the unseen compositions. Most previous work focuses on the "smart" part, while few of them provided an effective solution to achieve the "knowledgeable" goal. In this paper, we proposed a framework named Multi-Modal Prompt Tuning (MMPT) to inherit the "knowledgeable" property from the large pre-trained vision-LLM. Extensive experiments show that our proposed MMPT obtains new state-of-the-art results in OW-CZSL task. On the UT-Zappos dataset, MMPT pushes the AUC score to $29.8$, while the previous best score is $26.5$. On the more challenging MIT-States dataset, the AUC score of MMPT is 1.5 times better than the current state-of-the-art.

References (35)

Summary

The paper introduces MMPT, a multi-modal prompt tuning framework that enhances zero-shot compositional learning by aligning visual and textual prompts.
The paper employs a learnable prompt structure that projects shared text and visual cues, optimizing prompt length and layer depth for performance gains.
The paper achieves impressive results, notably a 29.8 AUC on UT-Zappos and a 4.1 AUC on MIT-states, setting a new state-of-the-art benchmark.

The paper introduces a sophisticated approach called Multi-Modal Prompt Tuning (MMPT) in order to enhance machine learning models' ability to recognize new, unseen compositions of known objects and attributes—essentially making them more "knowledgeable." This challenge is part of what the authors define as Open World Compositional Zero-Shot Learning (OW-CZSL), where a model must identify combinations of attributes and objects that it has not encountered during the training phase, without any assumptions about the possible outputs it might encounter.

MMPT leverages large pre-trained vision-LLMs by applying a novel structure of learnable prompts specifically tailored for this task. The system includes text prompts that describe attributes and objects, as well as a visual prompt that is a part of the input image. The idea is to project and align shared prompts across both the vision and text parts of the processing framework, allowing for a better understanding and reasoning of the compositions within the given images.

The proposed framework has been tested on two standard datasets for CZSL and has shown impressive results, surpassing the current state-of-the-art methodologies. For instance, on the UT-Zappos dataset, MMPT achieved an AUC score of 29.8, exceeding the previous best score of 26.5. Similarly, on the challenging MIT-states dataset, MMPT's performance outshone the competition by pushing the AUC to 4.1, which is more than 150% better than the former best.

A significant portion of the paper is dedicated to fine-tuning the parameters of the model such as the length of the shared prompt and the depth of layers employing these prompts. Investigations revealed that the gain in performance through the introduction of MMPT arises from the proposed combination of visual and text prompt tuning, which effectively bridges the gap between the different modalities of vision and language. The ability to enhance zero-shot learning in this way opens up new possibilities for future AI applications that require a deep and nuanced understanding of new visual concepts without extensive retraining.

PDF Markdown

Prompt Tuning for Zero-shot Compositional Learning (2312.02191v1)

Summary

Related Papers

Tweets