SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition (2407.20920v1)
Abstract: Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-LLMs (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in LLMs and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of VLMs. Specifically, we develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually and then aggregate them carefully through the quaternion network. Moreover, we present Gated Dual-Modal Alignments (GDMA) to bidirectionally interact visual and linguistic modalities while eliminating redundant cross-modal information, enabling more efficient region-level alignments. Rather than making the final prediction by a sharp manner in previous works, we propose a soft aggregator to jointly consider results from all image regions. With the help of flexible prompting and gated alignments, SSPA is generalizable to specific domains. Extensive experiments on nine datasets from three domains (i.e., natural, pedestrian attributes and remote sensing) demonstrate the state-of-the-art performance of SSPA. Further analyses verify the effectiveness of SSP and the interpretability of GDMA. The code will be made public.
- J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A unified framework for multi-label image classification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 2285–2294.
- Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label image recognition by recurrently discovering attentional regions,” in Int. Conf. Comput. Vis., 2017, pp. 464–472.
- S.-F. Chen, Y.-C. Chen, C.-K. Yeh, and Y.-C. Wang, “Order-free rnn with visual attention for multi-label classification,” in AAAI, vol. 32, no. 1, 2018.
- T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin, “Learning semantic-specific graph representation for multi-label image recognition,” in Int. Conf. Comput. Vis., 2019, pp. 522–531.
- Y. Wang, D. He, F. Li, X. Long, Z. Zhou, J. Ma, and S. Wen, “Multi-label classification with label graph superimposing,” in AAAI, vol. 34, no. 07, 2020, pp. 12 265–12 272.
- T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, and L. Zelnik-Manor, “Asymmetric loss for multi-label classification,” in Int. Conf. Comput. Vis., 2021, pp. 82–91.
- Z.-M. Chen, X.-S. Wei, P. Wang, and Y. Guo, “Multi-label image recognition with graph convolutional networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5177–5186.
- R. You, Z. Guo, L. Cui, X. Long, Y. Bao, and S. Wen, “Cross-modality attention with semantic graph embedding for multi-label classification,” in AAAI, vol. 34, no. 07, 2020, pp. 12 709–12 716.
- V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. v. d. Weijer, “Orderless recurrent models for multi-label classification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 13 440–13 449.
- J. Ye, J. He, X. Peng, W. Wu, and Y. Qiao, “Attention-driven dynamic graph convolutional network for multi-label image recognition,” in Eur. Conf. Comput. Vis. Springer, 2020, pp. 649–665.
- J. Zhao, Y. Zhao, and J. Li, “M3tr: Multi-modal multi-label recognition with transformer,” in ACM Int. Conf. Multimedia, 2021, pp. 469–477.
- S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu, “Query2label: A simple transformer way to multi-label classification,” arXiv preprint arXiv:2107.10834, 2021.
- J. Lanchantin, T. Wang, V. Ordonez, and Y. Qi, “General multi-label image classification with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 16 478–16 488.
- X. Zhu, J. Cao, J. Ge, W. Liu, and B. Liu, “Two-stream transformer for multi-label image classification,” in ACM Int. Conf. Multimedia, 2022, pp. 3598–3607.
- T. Ridnik, G. Sharir, A. Ben-Cohen, E. Ben-Baruch, and A. Noy, “Ml-decoder: Scalable and versatile classification head,” in IEEE Wint. App. Comput. Vis., 2023, pp. 32–41.
- X. Zhu, J. Liu, W. Liu, J. Ge, B. Liu, and J. Cao, “Scene-aware label graph learning for multi-label image classification,” in Int. Conf. Comput. Vis., 2023, pp. 1473–1482.
- M. Li, D. Wang, X. Liu, Z. Zeng, R. Lu, B. Chen, and M. Zhou, “Patchct: Aligning patch set and label set with conditional transport for multi-label image classification,” in Int. Conf. Comput. Vis., 2023, pp. 15 348–15 358.
- Z. Guo, B. Dong, Z. Ji, J. Bai, Y. Guo, and W. Zuo, “Texts as images in prompt tuning for multi-label image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 2808–2817.
- X. Huang, Y. Zhang, J. Ma, W. Tian, R. Feng, Y. Zhang, Y. Li, Y. Guo, and L. Zhang, “Tag2text: Guiding vision-language model via image tagging,” arXiv preprint arXiv:2303.05657, 2023.
- N. Mamat, M. F. Othman, R. Abdulghafor, A. A. Alwan, and Y. Gulzar, “Enhancing image annotation technique of fruit classification using a deep learning approach,” Sustainability, vol. 15, no. 2, p. 901, 2023.
- Z. Tan, Y. Yang, J. Wan, H. Hang, G. Guo, and S. Z. Li, “Attention-based pedestrian attribute analysis,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6126–6140, 2019.
- Z. Tan, Y. Yang, J. Wan, G. Guo, and S. Z. Li, “Relation-aware pedestrian attribute recognition with graph convolutional networks,” in AAAI, vol. 34, no. 07, 2020, pp. 12 055–12 062.
- X. Yang, Y. Li, and J. Luo, “Pinterest board recommendation for twitter users,” in ACM Int. Conf. Multimedia, 2015, pp. 963–966.
- H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 935–944.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. Mach. Learn. PMLR, 2021, pp. 8748–8763.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Int. Conf. Mach. Learn. PMLR, 2021, pp. 4904–4916.
- R. Abdelfattah, Q. Guo, X. Li, X. Wang, and S. Wang, “Cdul: Clip-driven unsupervised learning for multi-label image classification,” in Int. Conf. Comput. Vis., 2023, pp. 1348–1357.
- S. Xu, Y. Li, J. Hsiao, C. Ho, and Z. Qi, “A dual modality approach for (zero-shot) multi-label classification,” arXiv preprint arXiv:2208.09562, 2022.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022.
- ——, “Conditional prompt learning for vision-language models,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 816–16 825.
- T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, and Y. Bengio, “Quaternion recurrent neural networks,” arXiv preprint arXiv:1806.04418, 2018.
- Q. Cao, Z. Xu, Y. Chen, C. Ma, and X. Yang, “Domain prompt learning with quaternion networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 26 637–26 646.
- Z. Wang, X. Xu, G. Wang, Y. Yang, and H. T. Shen, “Quaternion relation embedding for scene graph generation,” IEEE Trans. Multimedia, vol. 25, pp. 8646–8656, 2023.
- T. Hao, T. Zichang, W. Dunfang, L. Ajian, W. Jun, L. Zhen, and L. S. Z., “Vision transformer with relation exploration for pedestrian attribute recognition,” IEEE Trans. Multimedia, 2024.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with image-level supervisions for multi-label image classification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5513–5522.
- S. Rajeswar, P. Rodriguez, S. Singhal, D. Vazquez, and A. Courville, “Multi-label iterated learning for image classification with label ambiguity,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4783–4793.
- P. Yang, M.-K. Xie, C.-C. Zong, L. Feng, G. Niu, M. Sugiyama, and S.-J. Huang, “Multi-label knowledge distillation,” in Int. Conf. Comput. Vis., 2023, pp. 17 271–17 280.
- T. Kobayashi, “Two-way multi-label loss,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 7476–7485.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- X. Sun, P. Hu, and K. Saenko, “Dualcoop: Fast adaptation to multi-label recognition with limited annotations,” Adv. Neural Inform. Process. Syst., vol. 35, pp. 30 569–30 582, 2022.
- P. Hu, X. Sun, S. Sclaroff, and K. Saenko, “Dualcoop++: Fast and effective adaptation to multi-label recognition with limited annotations,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
- S. He, T. Guo, T. Dai, R. Qiao, X. Shu, B. Ren, and S.-T. Xia, “Open-vocabulary multi-label classification via multi-modal knowledge transfer,” in AAAI, vol. 37, no. 1, 2023, pp. 808–816.
- X. Wang, J. Jin, C. Li, J. Tang, C. Zhang, and W. Wang, “Pedestrian attribute recognition via clip based prompt vision-language fusion,” arXiv preprint arXiv:2312.10692, 2023.
- Z. Huang, M. Zhang, Y. Gong, Q. Liu, and Y. Wang, “Generic knowledge boosted pre-training for remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
- F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
- S. Yan, N. Dong, L. Zhang, and J. Tang, “Clip-driven fine-grained text-image person re-identification,” IEEE Trans. Image Process., 2023.
- L. Lin, J. Zhang, and J. Liu, “Mutual information driven equivariant contrastive learning for 3d action representation learning,” IEEE Trans. Image Process., 2024.
- A. Liu, S. Xue, J. Gan, J. Wan, Y. Liang, J. Deng, S. Escalera, and Z. Lei, “Cfpl-fas: Class free prompt learning for generalizable face anti-spoofing,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 222–232.
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” in Int. Conf. Comput. Vis., 2023, pp. 15 659–15 669.
- M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 19 113–19 122.
- S. Menon and C. Vondrick, “Visual classification via description from large language models,” arXiv preprint arXiv:2210.07183, 2022.
- I. E. Toubal, A. Avinash, N. G. Alldrin, J. Dlabal, W. Zhou, E. Luo, O. Stretcu, H. Xiong, C.-T. Lu, H. Zhou et al., “Modeling collaborator: Enabling subjective vision classification with minimal human effort via llm tool-use,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 17 553–17 563.
- S. Jin, X. Jiang, J. Huang, L. Lu, and S. Lu, “Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors,” arXiv preprint arXiv:2402.04630, 2024.
- O. Mañas, P. Astolfi, M. Hall, C. Ross, J. Urbanek, A. Williams, A. Agrawal, A. Romero-Soriano, and M. Drozdzal, “Improving text-to-image consistency via automatic prompt optimization,” arXiv preprint arXiv:2403.17804, 2024.
- T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell, “Self-correcting llm-controlled diffusion models,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 6327–6336.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. Neural Inform. Process. Syst., vol. 30, 2017.
- A. Kumar and J. Vepa, “Gated mechanism for attention based multi modal sentiment analysis,” in ICASSP. IEEE, 2020, pp. 4477–4481.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Eur. Conf. Comput. Vis. Springer, 2014, pp. 740–755.
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010.
- T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in Proceedings of the ACM international conference on image and video retrieval, 2009, pp. 1–9.
- X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang, “Hydraplus-net: Attentive deep features for pedestrian analysis,” in Int. Conf. Comput. Vis., 2017, pp. 350–359.
- D. Li, Z. Zhang, X. Chen, and K. Huang, “A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,” IEEE Trans. Image Process., vol. 28, no. 4, pp. 1575–1590, 2018.
- Y. Deng, P. Luo, C. C. Loy, and X. Tang, “Pedestrian attribute recognition at far distance,” in ACM Int. Conf. Multimedia, 2014, pp. 789–792.
- Y. Hua, L. Mou, P. Jin, and X. X. Zhu, “Multiscene: A large-scale dataset and benchmark for multiscene recognition in single aerial images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2021.
- X. Qi, P. Zhu, Y. Wang, L. Zhang, J. Peng, M. Wu, J. Chen, X. Zhao, N. Zang, and P. T. Mathiopoulos, “Mlrsnet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 169, pp. 337–350, 2020.
- G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017.
- J. Xu, H. Tian, Z. Wang, Y. Wang, W. Kang, and F. Chen, “Joint input and output space learning for multi-label image classification,” IEEE Trans. Multimedia, vol. 23, pp. 1696–1707, 2020.
- K. Zhu and J. Wu, “Residual attention: A simple but effective method for multi-label recognition,” in Int. Conf. Comput. Vis., 2021, pp. 184–193.
- J. Zhao, K. Yan, Y. Zhao, X. Guo, F. Huang, and J. Li, “Transformer-based dual relation graph for multi-label image recognition,” in Int. Conf. Comput. Vis., 2021, pp. 163–172.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- Y. Wu, H. Liu, S. Feng, Y. Jin, G. Lyu, and Z. Wu, “Gm-mlic: graph matching based multi-label image classification,” arXiv preprint arXiv:2104.14762, 2021.
- Z.-M. Chen, X.-S. Wei, X. Jin, and Y. Guo, “Multi-label image recognition with joint class-aware map disentangling and label correlation embedding,” in Int. Conf. Multimedia and Expo. IEEE, 2019, pp. 622–627.
- Q. Li, X. Zhao, R. He, and K. Huang, “Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation.” in IJCAI, 2019, pp. 833–839.
- J. Jia, X. Chen, and K. Huang, “Spatial and semantic consistency regularizations for pedestrian attribute recognition,” in Int. Conf. Comput. Vis., 2021, pp. 962–971.
- H. Guo, X. Fan, and S. Wang, “Visual attention consistency for human attribute recognition,” Int. J. Comput. Vis., vol. 130, no. 4, pp. 1088–1106, 2022.
- D. Weng, Z. Tan, L. Fang, and G. Guo, “Exploring attribute localization and correlation for pedestrian attribute recognition,” Neurocomputing, vol. 531, pp. 140–150, 2023.
- J. Jia, N. Gao, F. He, X. Chen, and K. Huang, “Learning disentangled attribute representations for robust pedestrian attribute recognition,” in AAAI, vol. 36, no. 1, 2022, pp. 1069–1077.
- Y. Yang, Z. Tan, P. Tiwari, H. M. Pandey, J. Wan, Z. Lei, G. Guo, and S. Z. Li, “Cascaded split-and-aggregate learning with feature recombination for pedestrian attribute recognition,” Int. J. Comput. Vis., vol. 129, pp. 2731–2744, 2021.
- X. Cheng, M. Jia, Q. Wang, and J. Zhang, “A simple visual-textual baseline for pedestrian attribute recognition,” IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 10, pp. 6994–7004, 2022.
- Z. Tang and J. Huang, “Drformer: Learning dual relations using transformer for pedestrian attribute recognition,” Neurocomputing, vol. 497, pp. 159–169, 2022.
- X. Fan, Y. Zhang, Y. Lu, and H. Wang, “Parformer: Transformer-based multi-task network for pedestrian attribute recognition,” IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 1, pp. 411–423, 2023.
- J. Wu, Y. Huang, M. Gao, Y. Niu, M. Yang, Z. Gao, and J. Zhao, “Selective and orthogonal feature activation for pedestrian attribute recognition,” in AAAI, vol. 38, no. 6, 2024, pp. 6039–6047.
- Hao Tan (80 papers)
- Zichang Tan (25 papers)
- Jun Li (778 papers)
- Jun Wan (79 papers)
- Zhen Lei (205 papers)
- Stan Z. Li (222 papers)