Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models (2404.10357v2)

Published 16 Apr 2024 in cs.CV
Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Abstract: Vision-LLMs (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods. We will make all resources open-source: https://github.com/EMZucas/CoKnow.

This paper introduces a novel approach, Context Optimization with Multi-Knowledge Representation (CoKnow), to enhance the performance of Vision-LLMs (VLMs) such as CLIP in downstream tasks. The primary focus is on addressing the limitation of current prompt learning methods, which often lack diversity in prompt templates, thereby restricting the potential of pre-trained VLMs. The authors posit that a single text prompt may not fully capture the complexity of an image, and propose enriching the prompt context by incorporating knowledge from multiple perspectives and abstraction levels, termed Multi-Knowledge Representation.

The authors define Multi-Knowledge as comprising three types: visual knowledge (VK), non-visual knowledge (NVK), and panoramic knowledge (PK). VK includes captions describing the image or its category. NVK incorporates more abstract-level knowledge beyond visual aspects. PK combines multi-level descriptions, such as both VK and NVK, into a comprehensive description. The authors leverage the LLM GPT-4 to automatically generate Multi-Knowledge Representation using a set of simple prompt templates.

The CoKnow framework consists of two key modules: a prompt optimizer guided by Multi-Knowledge representation and a lightweight semantic knowledge mapper. The prompt optimizer facilitates adaptive learning of prompt templates rich in domain knowledge. The semantic knowledge mapper generates Multi-Knowledge Representation from images without requiring additional input. The framework is designed to be plug-and-play compatible with VLMs beyond CLIP.

The method involves inputting three types of templates into the text encoder: learnable context (soft prompt), Multi-Knowledge, and hand-crafted templates. The image encoder outputs are processed through semantic knowledge mappers. Contrastive loss is calculated between the image embeddings and the target template embeddings. The original image representation and the mapped image representation are combined using a weighting parameter β\beta before undergoing contrastive loss calculation with the learnable contexts. The semantic knowledge mappers, w1w_1 and w2w_2, are implemented as three-layer fully connected neural networks, where the hidden layer dimension is one-fourth of the input dimension, followed by a ReLU activation function.

During inference, the probability of an image II belonging to category ii is calculated using the following equations:

x=βx0+(1β2)x1+(1β2)x2x = \beta \cdot x_0 + (\frac{1-\beta }{2} )\cdot x_1 + (\frac{1-\beta }{2} )\cdot x_2

  • xx: combined image representation
  • β\beta: weight parameter
  • x0x_0: output of the image encoder for the given image II
  • x1x_1, x2x_2: outputs of the semantic knowledge mappers WW for the given image II

p(y=iI)=exp(sim(wi,x)/τ)k=1Kexp(sim(wk,x)/τ)p(y = i | I) = \frac{\exp(\text{sim}(w_i, x) / \tau)}{\sum_{k=1}^{K} \exp(\text{sim}(w_k, x) / \tau)}

  • p(y=iI)p(y = i | I): probability of image II belonging to category ii
  • wiw_i: output of the text encoder when the category is ii
  • sim(wi,x)\text{sim}(w_i, x): cosine similarity between wiw_i and xx
  • τ\tau: temperature parameter
  • KK: total number of categories

Experiments were conducted on 11 publicly available datasets, including ImageNet, Caltech, Oxford-Pets, Flowers, Food101, Stanford Cars, FGVC Aircraft, EuroSAT, UCF101, DTD, and SUN397. The authors followed the few-shot evaluation protocol outlined in CoOp, utilizing 1, 2, 4, 8, and 16 shots for training. ResNet-50 was used as the backbone architecture for the CLIP image encoder, with ViT-B/16 also evaluated. CoKnow consistently outperformed previous methods, demonstrating its effectiveness for prompt learning in VLMs. Specifically, the method demonstrates average top-1 accuracy rates surpassing CoOp across each dataset. With 4 shots, CoKnow's results approach those of CoOp with 8 shots and surpass Wise-FT's results.

Ablation studies were performed to analyze the impact of different Multi-Knowledge types (VK, NVK, PK) and the weighting parameter β\beta. The results indicated that PK generally provides the best performance. The authors also explored the impact of different context lengths and classname positions on the results. Additionally, the paper investigates the impact of using semantic knowledge mappers to map the image representations of the original CLIP. The experimental results indicate that the image representations of the original CLIP have a significant impact on Prompt Learning, especially when the training sample size is 1-shot.

The paper also evaluated the robustness of CoKnow under out-of-distribution (OOD) conditions, where prompt learning was conducted on ImageNet and tested on ImageNetV2. The results suggest that the method effectively generalizes to out-of-distribution datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Cifar-10: Knn-based ensemble of classifiers. In 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pages 1192–1195. IEEE, 2016.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  4. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  5. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  7. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  8. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024.
  9. Texts as images in prompt tuning for multi-label image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2808–2817, 2023a.
  10. Calip: Zero-shot enhancement of clip with parameter-free attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 746–754, 2023b.
  11. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  12. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  13. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  14. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  15. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  16. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19325–19337, 2023.
  17. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  18. Enhancing clip with gpt-4: Harnessing visual descriptions as prompts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 262–271, 2023.
  19. Context-aware robust fine-tuning. International Journal of Computer Vision, pages 1–16, 2023.
  20. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  21. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021a.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  24. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
  25. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  26. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  27. Fine-grained retrieval prompt tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2644–2652, 2023.
  28. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022.
  29. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  30. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023.
  31. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  32. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
  33. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022.
  34. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  35. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
  36. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
  37. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659–15669, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Enming Zhang (14 papers)
  2. Yingying Chen (37 papers)
  3. Qinghai Miao (5 papers)
  4. Ming Tang (199 papers)
  5. Jinqiao Wang (76 papers)
  6. Bingke Zhu (13 papers)