Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt (2405.06926v1)

Published 11 May 2024 in cs.CV

Abstract: The recent introduction of prompt tuning based on pre-trained vision-LLMs has dramatically improved the performance of multi-label image classification. However, some existing strategies that have been explored still have drawbacks, i.e., either exploiting massive labeled visual data at a high cost or using text data only for text prompt tuning and thus failing to learn the diversity of visual knowledge. Hence, the application scenarios of these methods are limited. In this paper, we propose a pseudo-visual prompt~(PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-LLMs. Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art~(SOTA) methods across various settings for multi-label image classification tasks. The code is available at https://github.com/njustkmg/PVP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Laso: Label-set operations networks for multi-label few-shot learning. In CVPR, pages 6548–6557, 2019.
  2. À-la-carte prompt tuning (APT): combining distinct data via composable prompting. In CVPR, pages 14984–14993, 2023.
  3. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In WWW, pages 2778–2788, 2022.
  4. Semantic prompt for few-shot image recognition. CoRR, abs/2303.14123, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  6. NUS-WIDE: a real-world web image database from national university of singapore. In CIVR, 2009.
  7. The pascal visual object classes (VOC) challenge. IJCV, 88(2):303–338, 2010.
  8. Noise-aware image captioning with progressively exploring mismatched words. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, AAAI, pages 12091–12099. AAAI Press, 2024.
  9. Clip-adapter: Better vision-language models with feature adapters. CoRR, abs/2110.04544, 2021.
  10. PPT: pre-trained prompt tuning for few-shot learning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, ACL, pages 8410–8423, 2022.
  11. Texts as images in prompt tuning for multi-label image recognition. In CVPR, pages 2808–2817, 2023.
  12. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  13. Dualcoop++: Fast and effective adaptation to multi-label recognition with limited annotations. TPAMI, 2023.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139, pages 4904–4916, 2021.
  15. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017.
  16. Multi-label zero-shot learning with structured knowledge graphs. In CVPR, pages 1576–1585, 2018.
  17. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, volume 202, pages 19730–19742, 2023.
  18. Microsoft COCO: common objects in context. In ECCV, volume 8693, pages 740–755, 2014.
  19. Dekun Lin. Probability guided loss for long-tailed multi-label image classification. In AAAI, pages 1577–1585, 2023.
  20. Label specific multi-semantics metric learning for multi-label classification: Global consideration helps. In IJCAI, pages 4055–4063, 2023.
  21. Noisy channel language model prompting for few-shot text classification. In ACL, 2022.
  22. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, pages 2930–2939, 2016.
  23. Pro-tuning: Unified prompt tuning for vision tasks. CoRR, 2022.
  24. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  25. Semantic-aware representation blending for multi-label image recognition with partial labels. In AAAI, pages 2091–2098, 2022.
  26. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139, pages 8748–8763, 2021.
  27. Meta-learning for multi-label few-shot classification. In MACV, pages 346–355, 2022.
  28. Dualcoop: Fast adaptation to multi-label recognition with limited annotations. In NeurIPS, 2022.
  29. Appearance prompt vision transformer for connectome reconstruction. In IJCAI, pages 1423–1431, 2023.
  30. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  31. CNN-RNN: A unified framework for multi-label image classification. In CVPR, pages 2285–2294, 2016.
  32. Multi-label image recognition by recurrently discovering attentional regions. In ICCV, pages 464–472, 2017.
  33. Deep partial multi-label learning with graph disambiguation. In IJCAI, pages 4308–4316, 2023.
  34. Hierarchical prompt learning for compositional zero-shot recognition. In IJCAI, pages 1470–1478, 2023.
  35. HCP: A flexible CNN framework for multi-label image classification. TPAMI, 38(9):1901–1907, 2016.
  36. Robust semi-supervised learning for self-learning open-world classes. In ICDM, pages 658–667, 2023.
  37. Bridgetower: Building bridges between encoders in vision-language representation learning. In AAAI, pages 10637–10647, 2023.
  38. Prompt learns prompt: Exploring knowledge-aware generative prompt collaboration for video captioning. In IJCAI, pages 1622–1630, 2023.
  39. Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport. In KDD, pages 2594–2603, 2018.
  40. Semi-supervised multi-modal multi-instance multi-label deep network with optimal transport. TKDE, 33, 2021.
  41. Scis. Sci. China Inf. Sci., 66(12), 2023.
  42. Black-box prompt tuning for vision-language model as a service. In IJCAI, pages 1686–1694, 2023.
  43. GLM-130B: an open bilingual pre-trained model. In ICLR, 2023.
  44. Conditional prompt learning for vision-language models. In CVPR, pages 16795–16804, 2022.
  45. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
  46. Text as image: Learning transferable adapter for multi-label classification. CoRR, abs/2312.04160, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiangyu Wu (40 papers)
  2. Qing-Yuan Jiang (12 papers)
  3. Yang Yang (883 papers)
  4. Yi-Feng Wu (3 papers)
  5. Qing-Guo Chen (19 papers)
  6. Jianfeng Lu (273 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com