Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompt Tuning for Zero-shot Compositional Learning (2312.02191v1)

Published 2 Dec 2023 in cs.CV and cs.AI

Abstract: Open World Compositional Zero-Shot Learning (OW-CZSL) is known to be an extremely challenging task, which aims to recognize unseen compositions formed from seen attributes and objects without any prior assumption of the output space. In order to achieve this goal, a model has to be "smart" and "knowledgeable". To be smart, a model should be good at reasoning the interactions between attributes and objects from the seen compositions. While "knowledgeable" means the model owns "common sense" to the open world that can "foresee" some features of the unseen compositions. Most previous work focuses on the "smart" part, while few of them provided an effective solution to achieve the "knowledgeable" goal. In this paper, we proposed a framework named Multi-Modal Prompt Tuning (MMPT) to inherit the "knowledgeable" property from the large pre-trained vision-LLM. Extensive experiments show that our proposed MMPT obtains new state-of-the-art results in OW-CZSL task. On the UT-Zappos dataset, MMPT pushes the AUC score to $29.8$, while the previous best score is $26.5$. On the more challenging MIT-States dataset, the AUC score of MMPT is 1.5 times better than the current state-of-the-art.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Exploring visual prompts for adapting large-scale models. volume 1, page 4, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. An image is worth 16x16 words: Transformers for image recognition at scale. 2021.
  4. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, 2021.
  5. Discovering states and transformations in image collections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1383–1391, 2015.
  6. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  7. Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.
  8. How can we know what language models know? In Transactions of the Association for Computational Linguistics, volume 8, pages 423–438, 2020.
  9. Kg-sp: Knowledge guided simple primitives for open world compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9336–9345, 2022.
  10. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  11. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
  12. Human-level concept learning through probabilistic program induction. volume 350, pages 1332–1338. American Association for the Advancement of Science, 2015.
  13. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  14. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4953–4963, 2022.
  15. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16420–16429, 2022.
  16. Siamese contrastive embedding network for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9335, 2022.
  17. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597. Association for Computational Linguistics, 2021.
  18. Symmetry and group in attribute-object compositions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11316–11325, 2020.
  19. Open world compositional zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5222–5230, 2021.
  20. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1792–1801, 2017.
  21. Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 953–962, 2021.
  22. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018.
  23. Task-driven modular networks for zero-shot compositional learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3593–3602, 2019.
  24. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  25. Independent prototype propagation for zero-shot compositionality. Advances in Neural Information Processing Systems, 34:10641–10653, 2021.
  26. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020.
  27. Conceptnet at semeval-2017 task 2: Extending word embeddings with multilingual relational knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 85–89, 2017.
  28. Attention is all you need. volume 30, 2017.
  29. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. volume 41, 2019.
  30. Zero-shot compositional concept learning. In Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing, pages 19–27. Association for Computational Linguistics, Aug. 2021.
  31. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 192–199, 2014.
  32. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.
  33. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  34. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022.
  35. Learning to prompt for vision-language models. In International Journal of Computer Vision, volume 130, pages 2337–2348, 2022.

Summary

  • The paper introduces MMPT, a multi-modal prompt tuning framework that enhances zero-shot compositional learning by aligning visual and textual prompts.
  • The paper employs a learnable prompt structure that projects shared text and visual cues, optimizing prompt length and layer depth for performance gains.
  • The paper achieves impressive results, notably a 29.8 AUC on UT-Zappos and a 4.1 AUC on MIT-states, setting a new state-of-the-art benchmark.

The paper introduces a sophisticated approach called Multi-Modal Prompt Tuning (MMPT) in order to enhance machine learning models' ability to recognize new, unseen compositions of known objects and attributes—essentially making them more "knowledgeable." This challenge is part of what the authors define as Open World Compositional Zero-Shot Learning (OW-CZSL), where a model must identify combinations of attributes and objects that it has not encountered during the training phase, without any assumptions about the possible outputs it might encounter.

MMPT leverages large pre-trained vision-LLMs by applying a novel structure of learnable prompts specifically tailored for this task. The system includes text prompts that describe attributes and objects, as well as a visual prompt that is a part of the input image. The idea is to project and align shared prompts across both the vision and text parts of the processing framework, allowing for a better understanding and reasoning of the compositions within the given images.

The proposed framework has been tested on two standard datasets for CZSL and has shown impressive results, surpassing the current state-of-the-art methodologies. For instance, on the UT-Zappos dataset, MMPT achieved an AUC score of 29.8, exceeding the previous best score of 26.5. Similarly, on the challenging MIT-states dataset, MMPT's performance outshone the competition by pushing the AUC to 4.1, which is more than 150% better than the former best.

A significant portion of the paper is dedicated to fine-tuning the parameters of the model such as the length of the shared prompt and the depth of layers employing these prompts. Investigations revealed that the gain in performance through the introduction of MMPT arises from the proposed combination of visual and text prompt tuning, which effectively bridges the gap between the different modalities of vision and language. The ability to enhance zero-shot learning in this way opens up new possibilities for future AI applications that require a deep and nuanced understanding of new visual concepts without extensive retraining.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets