Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models (2312.06323v1)
Abstract: Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As LLMs have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736.
- Food-101–mining discriminative components with random forests. In European conference on computer vision, 446–461. Springer.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18030–18040.
- Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3606–3613.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, 178–178. IEEE.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7): 2217–2226.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8340–8349.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15262–15271.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
- Visual prompt tuning. In European Conference on Computer Vision, 709–727. Springer.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, 554–561.
- Bridge-prompt: Towards ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19880–19889.
- Learning customized visual models with retrieval-augmented knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15148–15158.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9): 1–35.
- Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5206–5215.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
- Visual classification via description from large language models. arXiv preprint arXiv:2210.07183.
- Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, 529–544. Springer.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 722–729. IEEE.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, 3498–3505. IEEE.
- Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050.
- What does a platypus look like? generating customized prompts for zero-shot image classification. arXiv preprint arXiv:2209.03320.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, 5389–5400. PMLR.
- K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35: 15558–15573.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Multi-task neural network for non-discrete attribute prediction in knowledge graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1029–1038.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139–149.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 3485–3492. IEEE.
- Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35: 124–141.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19163–19173.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15211–15222.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- MA-BERT: learning representation by incorporating multi-attribute knowledge in transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2338–2343.
- Learning domain invariant prompt for vision-language models. arXiv preprint arXiv:2212.04196.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.