Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning (2403.01209v1)

Published 2 Mar 2024 in cs.CV

Abstract: This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained LLM to learn prompts to adapt pretrained Vision-LLM (VLM) like CLIP to multilabel classification. Through asking LLM by well-designed questions, we acquire comprehensive knowledge about characteristics and contexts of objects, which provides valuable text descriptions for learning prompts. Then we propose a hierarchical prompt learning method by taking the multi-label dependency into consideration, wherein a subset of category-specific prompt tokens are shared when the corresponding objects exhibit similar attributes or are more likely to co-occur. Benefiting from the remarkable alignment between visual and linguistic semantics of CLIP, the hierarchical prompts learned from text descriptions are applied to perform classification of images during inference. Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition. Extensive experiments on three public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that our method achieves better results than the state-of-the-art methods, especially outperforming the zero-shot multi-label recognition methods by 4.7% in mAP on MS-COCO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. G2netpl: Generic game-theoretic network for partial-label image classification. BMVC, 2022a.
  2. Plmcl: Partial-label momentum curriculum learning for multi-label image classification. In ECCV, pages 39–55. Springer, 2022b.
  3. Cdul: Clip-driven unsupervised learning for multi-label image classification. In ICCV, pages 1348–1357, 2023.
  4. Laso: Label-set operations networks for multi-label few-shot learning. In CVPR, pages 6548–6557, 2019.
  5. Semantic diversity learning for zero-shot multi-label classification. In ICCV, pages 640–650, 2021.
  6. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  8. Low-code llm: Visual programming over llms. arXiv preprint arXiv:2304.08103, 2023.
  9. Tem-adapter: Adapting image-text pretraining for video question answer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13945–13955, 2023.
  10. Learning semantic-specific graph representation for multi-label image recognition. In ICCV, pages 522–531, 2019a.
  11. Structured semantic transfer for multi-label recognition with partial labels. In AAAI, pages 339–346, 2022.
  12. Multi-label image recognition with graph convolutional networks. In CVPR, pages 5177–5186, 2019b.
  13. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pages 1–9, 2009.
  14. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022.
  15. Learning a deep convnet for multi-label classification with partial labels. In CVPR, pages 647–657, 2019.
  16. The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
  17. Multi-label image recognition with multi-class attentional regions. arXiv e-prints, pages arXiv–2007, 2020.
  18. Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
  19. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894, 2013.
  20. Texts as images in prompt tuning for multi-label image recognition. In CVPR, pages 2808–2817, 2023a.
  21. Calip: Zero-shot enhancement of clip with parameter-free attention. In AAAI, pages 746–754, 2023b.
  22. Visual programming: Compositional visual reasoning without training. In CVPR, pages 14953–14962, 2023.
  23. Reinforced multi-label image classification by exploring curriculum. In AAAI, 2018.
  24. A shared multi-attention framework for multi-label zero-shot learning. In CVPR, pages 8776–8786, 2020.
  25. Deep ranking for image zero-shot multi-label classification. IEEE TIP, 29:6549–6560, 2020.
  26. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
  27. Large loss matters in weakly supervised multi-label classification. In CVPR, pages 14156–14165, 2022.
  28. Exploiting weakly supervised visual patterns to learn from partial annotations. NeurIPS, 33:561–572, 2020.
  29. Multi-label zero-shot learning with structured knowledge graphs. In CVPR, pages 1576–1585, 2018.
  30. Improving pairwise ranking for multi-label image classification. In CVPR, pages 3617–3625, 2017.
  31. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  32. Semantic regularisation for recurrent image annotation. In CVPR, pages 2872–2880, 2017.
  33. On the optimality of classifier chain for multi-label classification. NeurIPS, 28, 2015.
  34. Presence-only geographical priors for fine-grained image classification. In ICCV, pages 9596–9606, 2019.
  35. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, pages 2930–2939, 2016.
  36. Semantic-aware representation blending for multi-label image recognition with partial labels. In AAAI, pages 2091–2098, 2022.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. Asymmetric loss for multi-label classification. In ICCV, pages 82–91, 2021.
  39. Deep imbalanced attribute classification using visual attention aggregation. In ECCV, pages 680–697, 2018.
  40. Meta-learning for multi-label few-shot classification. In WACV, pages 3951–3960, 2022.
  41. Ad-clip: Adapting domains in prompt space using clip. In ICCV, pages 4355–4364, 2023.
  42. Visual prompt tuning for generative transfer learning. In CVPR, pages 19840–19851, 2023.
  43. Dualcoop: Fast adaptation to multi-label recognition with limited annotations. NeurIPS, 35:30569–30582, 2022.
  44. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, pages 5227–5237, 2022.
  45. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  46. Probvlm: Probabilistic adapter for frozen vison-language models. In ICCV, pages 1899–1910, 2023.
  47. Cnn-rnn: A unified framework for multi-label image classification. In CVPR, pages 2285–2294, 2016.
  48. Multi-label classification with label graph superimposing. In AAAI, pages 12265–12272, 2020.
  49. Multi-label image recognition by recurrently discovering attentional regions. In ICCV, pages 464–472, 2017.
  50. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. Survey Certification.
  51. Hcp: A flexible cnn framework for multi-label image classification. IEEE TPAMI, 38(9):1901–1907, 2015.
  52. Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023.
  53. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311, 2023.
  54. An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI, pages 3081–3089, 2022.
  55. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  56. Orderless recurrent models for multi-label classification. In CVPR, pages 13440–13449, 2020.
  57. Attention-driven dynamic graph convolutional network for multi-label image recognition. In ECCV, pages 649–665. Springer, 2020.
  58. Multilabel image classification with regional latent semantic dependencies. IEEE TMM, 20(10):2801–2813, 2018.
  59. Tip-adapter: Training-free adaption of clip for few-shot classification. In ECCV, pages 493–510. Springer, 2022.
  60. Learning in imperfect environment: Multi-label classification with long-tailed distribution and partial labels. In ICCV, pages 1423–1432, 2023.
  61. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022a.
  62. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
  63. Prompt-aligned gradient for prompt tuning. In ICCV, pages 15659–15669, 2023.
  64. Learning spatial regularization with image-level supervisions for multi-label image classification. In CVPR, pages 5513–5522, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shuo Yang (244 papers)
  2. Zirui Shang (5 papers)
  3. Yongqi Wang (24 papers)
  4. Derong Deng (1 paper)
  5. Hongwei Chen (37 papers)
  6. Qiyuan Cheng (9 papers)
  7. Xinxiao Wu (21 papers)
Citations (2)